Simple 
Regression and 
Correlation 
Analysis 


A Preview of Things to Look For 

What is meant by ordinary least squares. 

How the regression model can minimize the sum of the squared errors. 
The distinction between the dependent and the independent variable. 

The fact that the values for the dependent variable are normally distributed. 


The assumptions of the OLS model. 


aa ew 


How the standard error of the estimate and the coefficient of determination 
canbe used as measures of goodness-of-fit. 

7. The difference between explained variation and unexplained variation. 

8. The manner in which regression analysis can be used to form interval esti- 


mates and test-hypotheses. 
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This chapter examines two of the most important tools of statistical analysis: 
regression and correlation analysis. These highly useful techniques illustrate 
the manner in which relationships between two variables can be analyzed and 


used to predict future events. 
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Statistics show that people who snore the loudest always fall 
asleep first. 


6 INTRODUCTION 
Of all the statistical techniques that you are so diligently mastering in this text, none is 


more important than regression and correlation analysis. Many empirical studies rely 
quite heavily on these statistical tools. They are perhaps the most commonly used ji) 
forms of statistical analysis, and.are invaluable when making a large number of) 
business and economic decisions. Regression and correlation often prove vital in) 
identifying the nature of the relationship among the business and economic variables 
that decision makers work with on a daily basis. It is difficult to overemphasize the 
importance of regression and correlation analysis and the extent to which they can be: 
used to solve problems and make business decisions. 

Regression and correlation analysis recognize that there may be a determinable » 
and quantifiable relationship between two or more variables. That is, one variable: 
depends on another and can be détermined by it; or we can say that one variable is a, 
function of another. This can be stated as 


i Y= f(x) ý [13.1] 


which is read ‘‘Y is a function of X,” and states that Y depends on X in- some manner. 
Since Y depends on X, it is the dependent variable and X is the independent 


variable. 


Regression analysis was first developed by the English scientist Sir Francis 
Galton (1822-1911). His earliest experiments with regression began with an attempt 
to analyze hereditary tendencies of sweet peas, Encouraged by the results, Sir Francis 
extended his study to include the hereditary patterns in the heights of adult humans. 
He found that children of parents who were unusually tall or unusually short would 
tend to “‘regress’’ back toward the average height of the adult population. With this 
humble introduction, the use of regression analysis has exploded into one of the most 
powerful statistical tools available. 

Determining which is the dependent variable and which is the independent 
variable is crucial. This determination depends on common logic and what the 
statistician is trying to investigate, For example, the dean of the College of Business i$ 
concerned about students’ GPAs and the amount of time they spend studying. Data are 
gathered to examine this relationship. It is only logical to presume that more hours 
spent studying will result in higher GPAs and that longer periods of study can explain 
higher GPAs. Thus, GPAs are the dependent variable and study time is the indepen- 
dent variable. 
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Consider this common example of the distinction between the dependent and the 
independent variables which is often encountered in the business world. A firm’s sales 
depend, at least in part, on the amount of advertising that it does. Sales is seen as the 
dependent variable and is a function of the independent variable, advertising. In this 
manner, advertising can be used to predict and forecast sales. The dependent variable 
Y is also referred to as the regressand or the explained variable, while the indepen- 
dent variable X is called the regressor or the explanatory variable. Statistical 
Application 13.1 tells how one enterprising student uscd regression analysis for direct 
monetary gain. 


A graduate student in the health sciences at a private university in Illinois was 
asked: by a local medical association to assist in estimating the demand for 
hospital beds in the area. The association wished to obtain some idea as to the 
number of patients who might require medical services in the future. As part of 
her effort, the student formulated a regression model using total population as an 
independent variable that could explain the number of hospital patients. For this 
effort, the student was given $1,500, which she used to complete her graduate 
studies. 4 


Regression and correlation are actually two different but closely related concepts. 
Regression is a quantitative expression of the basic nature of the relationship between 
the dependent and independent variables; For example, given a simple regression 
model with one independent variable, the regression model will determine if both 
variables tend to move in the same direction (both increase or both decrease simul- 
taneously) or opposite directions (one goes up while the other goes down). It will also 
reveal the amount by which Y will change given a one-unit change in the independent 
variable. Š 

Correlation, on the other hand, determines the strength of the relationship. That 
is, while regression describes the basic nature of the relationship between the two 
variables, correlation measures how strong that relationship is. 


We should distinguish between simple and multiple regression. Simple regression 
holds that the dependent variable Y is a function of only one independent variable, as 
indicated in Formula (13.1). It is sometimes called bivariate analysis because only 
two variables are involved—one, dependent and one independent. 

Multiple regression involves two or more independent variables. If Y is said to 
depend on three independent variables, we can write Y = f(X» Xp, X3). A model 

. containing k independent variables can be expressed as 


Y = f(X;, Xz, .-- + Xe) raa] 


where X;, X», - - - , X; are independent variables used to explain F. 
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It is also important to distinguish between linear and curvilinear regression 
Linear regression attempts to depict the relationship between X and Y by a straigh, | 
line. This procedure is based on the contention that a change in X is accompanied by a 
systematic change in Y, which can be represented by a line. Curvilinear regression is 
used if the relationship can better be described by a curve. Statistical Application 13,2 
reveals how regression can bè used to ‘jump start” a new business, 

The nature of linear regression, as well as the manner in which it differs from 
curvilinear regression, can perhaps be best illustrated by the scatter diagrams in Figure _ 
13-1. Scatter diagrams plot the paired observations of X and Y on a graph. Custom- | 
arily, the dependent variable is placed on the vertical axis, while the independent | 
variable is on the horizontal axis. Assume the director of marketing for a large retail 
chain collects data on sales levels and advertising expenditures for her company over 
the last several months. Her intent is to use advertising expenditures to explain the 
levels of sales for her firm. If thése data were plotted in a scatter diagram using sales 
as the dependent variable, several possible patterns might emerge. For example, the 
pattern in Figure 13-1(a) suggests that as advertising increases, sales go up. This is the 
logical assumption that the director might make before collecting her data, and 
indicates that a positive, or direct, linear relationship exists between advertising and 
sales. However, after plotting the data, the director might find, to her surprise, an 

i inverse, or indirect, linear relationship as shown by Figure 13-1(b). Here the data 
Suggest that as advertising expenditures are increased, sales actually go down. This 
would suggest, of course, that fonds, spent on advertising might be used more 
effectively elsewhere. ` 


Advertising Weekly contained an account of how a small electronic firm in the 
Silicon Valley south of San Francisco used regression analysis to predict future 
sales levels of. their fledgling firm. Data were collected for the number of new 
electronics firms in the area, an index of the general level of activity in the 
electronics industry, and measures of the health of the national economy. With the 
use of multiple regression, these three variables were used to devise estimates of 
future sales levels. 


On the other hand, it is difficult to observe: any distinct relationship between 
advertising and sales in the scatter diagram in Figure 13-1(c). The pattern here 
suggests that no relationship exists between the two variables. r 

Figures 13-1(d) and 13-1(e) indicate curvilinear relationships. Notice that the 
patterns in the scatter diagrams seem to take on a nonlinear or curved shape. 

The objective of regression analysis is to develop a line that passes through the 
scatter diagram and best represents the data points. Our interest throughout most of 
this chapter will focus primarily on simple linear regression. 


inear and Curvilinear Relationships 
“changes, Y changes by a constant: 
e je by) a constant ene as A chang 


i 
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Pricure 13-1 o and Curvilinear Relationships 


©) 


Sales 


@ 


Sales 


Adv 


E THE MECHANICS OF A STRAIGHT LINE 


Since we are going to explain the relationship between X and Y on the basis of a 
straight line, we need to review a few facts about the formula for a straight line. A 
straight line can be expressed by the formula 


Y= ba + bX t [13.3] 
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where bg is the vertical intercept and b, is the slope of the line. If we were to find, fo, 
example, that bọ = 5 and b, = 2, we would have 


| Y=5+2X [13.4] 


Figure 13-2 shows that the vertical intercept is 5. It represents the value of y 
when X = 0. The slope of a line indicates the change in Y that occurs for every 1-unit 
change in X. It is found as the rise divided by the run. The rise is the vertical change, or 
change in the Y-variable since Y is measured on the vertical axis. The run is a one-unit 
change in the value of the X-variable measured on the horizontal axis. 


A Linear 
Relationship 


Given Formula (13.4), if X = 10, Y would equal 25. If X increases by 1 unit to 11, 
Y becomes 27. That is, Y goes up by 2. The slope is then 


Vertical change 2 
1. 


Horizontal change ae 


b, = Slope = = 
It can be seen that for every 1-unit increase in X, Y goes up by 2 units. 

The slope can be positive, as here, or negative if b, < O. If the slope is negative, 
the line is downward-sloping, as in Figure 13-3(a), indicating that X and ¥ move in 
opposite directions. If b, is 0, the line has a slope of zero and is horizontal. This 
suggests that there is no relationship between X and Y, since Y remains constant vel 
if X changes. Figure’ 13-3(b) shows that as X changes from X, to X2, the value of T 


does not change. 


a) Y (b) Y 


Other Potential 
Linear Relationships 
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a.1-unit change in X will be associated 


Relationships between variables are cither deterministic or stochastic (random). An 
example of a deterministic relationship can be expressed by a mathematical model, or 
formula, that converts speed in miles per hour (mph) into kilometers per hour (kph). 
Since 1 mile equals approximately 1.6 kilometers, this model is 1 mph = 1.6 kph. 
Thus, a speed of 5 mph = 5(1.6)kph = 8.0 kph. This is a deterministic model because 
there is no error (except for rounding) in the determination of the rate of speed in kph. 
Given any value for mph, we can determine kph exactly. When mph equals 5, kph is 
always 8. š 

Unfortunately, few relationships in the business world are so exact or so easily 
determined. In using advertising to determine sales, for.example, there is almost 
always some variation in the relationship. When advertising is some given amount, X, 
sales will take on some value. However, the next time advertising equals that same 
amount X, sales could very well be some other value. The dependent variable (sales) 
demonstrates some degree of randomness. A model of this nature is said to be 
stochastic, due to the presence of random variation. A model that reflects that varia- 
tion is é 


Y= Bot BX) + € 
(Eont) ( random ‘ 
component component [13.5] 


Formula (13.5) represents the population (or true) relationship in which we 
regress Y on X. Since Formula (13.5) pertains to a population, Bọ and B; are 
parameters. The vertical intercept of the line used to reflect the relationship at the 
population level between X and Y is represented by Bg, while B, is the slope. The 
value € (Greek letter epsilon) is a random error term designed to capture variation 
above and below the regression line due to all other factors not included in the model. 
For example, in addition to advertising, sales are probably also influenced by the level 
of competition, location, relative prices, and other factors. The random component €; 
may be positive or negative, depending on whether a value of Y, given any X value, 
lies above or below the regression line. It is also called the disturbance term since it 
“disturbs” the otherwise deterministic relationship between X and Y. 


‘ To illustrate, assume that a computer manufacturer wishes to examine the rela- 
tionship between the number of hard-disk drives produced and the total cost. The 
firm’s head financial analyst and statistician collects data over a five-day period for 
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| 
the number of drives produced and the corresponding costs. Although a sample of | 
only five observations is most likely insufficient, it will serve the purposes of 
illustration. The data are displayed in Table 13-1. These data are then plotted in a 
scatter diagram shown in Figure 13-4. If a line is drawn through the middle of the 
scatter, some observations fall above it while others fall below it. Few, if any, 
relationships in the real world are perfectly linear. Therefore, not all the observations 
will fall directly on the regression line. There will likely be some variation above and 
below it. This deviation above and below the line is reflected in Formula (13.5) by €. 


Number of 


Production Data for 
Computer Hardware Da Drives Cost 


1 50 $450 
2 40 380 
3 65 540 
4 55 500 . 
5 45 420 


FIGURE 13-4 | 


A Scatter Diagram 
for Production Data 


1 re 
40 45 50 55 6 65 70 
Disk drives « 


It should be emphasized that the true population regression line will, like most 
parameters, remain unknown. The best that we can do is estimate it by using the 
sample model illustrated by Formula (13.6A). `~ 


Y=botbX+e [13.64] 


The values bọ and b, are estimates for the population parameters By and B}. They ar 
called, respectively, the regression constant and the regression coefficient. The last 
term, e, is the error component, which is necessary because not all observations for x 
and ¥ fall exactly on a straight line. Since some of the observations fall above the lin? 
and others fall below it, e is a random variable. However, it is assumed, as emphasi2@ 
later in this chapter, that the error term will have a mean value of zero anda varianc? 
of some amount we will call o2. 


4 
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The model expressed by Formula (13.6A) is then used to estimate the relationship 
between X and Y, resulting in the regression line 


¥ = bo +b,X [13.6B] 


where ¥ (pronounced ‘*Y-hat’’) is the estimated value for the dependent variable and 
is represented by a point on the regression line. 


To gain a better understanding of regression analysis, let us return to our scatter 
diagram for data on the computer disk drives (Figure 13-4). We wish to determine a 
line through the middle of the scatter that best defines or represents these data points. 
This regression line must depict the relationship between the dependent and indepen- 
dent variables with accuracy and precision, and must fit these data points better than 
any other line we might be able to draw. We are looking for the line of best fit. It is not 
possible to simply eyeball the exact placement of the line, as might have been 
suggested above. 
Notice in Figure 13-5 that all the lines appear to fit the dispersion of the data 
` portrayed by the scatter diagram. Each line seems a likely choice for the line of best 
fit. It is impossible to determine by mere inspection which line actually provides the 
most exact measure of the relationship between X and Y. This illustrates the need for a 
much more accurate procedure of determining our line of best fit. 


FIGURE 13-5 


Possible Lines of Fit 


This more precise method is called ordinary least squares (OLS). To aid in the 
illustration of OLS, the. scatter diagram for the computer disk drives is repeated in 
Figure 13-6. 

It is called OLS because it results in a line which minimizes the squared vertical 
distances from each observation point to the line itself. To understand the meaning of 
OLS, you must remember that, as illustrated in Figure 13-6, Y, is an actual, observed 
value for the ¥-variable, and F is a value on the line predicted by the equation. We 
then calculate the vertical difference between Y; and Î, Y, — Î. That difference is then 
squared to yield (Y, — Ê)’. This is done for all five values of Y, All five squared 
differences are summed and expressed as á 
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Fisure 13-6 


Ordinary Least 
Squares 


An observed value 
Yı for Y when X = 55 


ss 


TS Î Estimated value 


Error of Y when X =55 


Y; An observed value 


a for ¥ when X =40 
ad ss be 1 
40 45 oS o 6 70 
Disk drives 
` XW, — PP = min [13.7] | 


where min is a number smaller than what you would get if you summed these squared 
vertical deviations between the actual data points and any other line. Hence, the term 
least squares is used. 

The difference Y, — Ý is called the residual, or error. It is the difference between 
what Y actually is, Y, and what we predicted it to be using our regression model, Y. 
This difference between actual values, Y, and what we predicted them to be using our 
regression model, ¥, is, of course, our error. OLS will minimize the sum of the 
squared errors. 


It can be shown by means of differential calculus that this sum of the squared 
errors will indeed be minimized by calculating the sums of squares and cross- 
products. The proof of this statement, which is omitted here, can be found in calculus 


books ‘or more advanced statistics texts. 
We can calculate the sums of squares of X (SSx), the sums of squares of Y (SSY) 


and the sums of the cross-products (SSxy) as 


SSx = XX; — XP 
= ee Gx? 
NE [13.8] 


i 
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SSy = X(¥, — FP 
= Sy? -— Ery 
n [13.9] 
and 

i SSxy = E(X, — XXY, - Y) 

-yyy VEN 
$ n & [13.10] 


Notice that the first portions of each of these formulas, 


SSx = XX; — XP 
SSy = S(¥, — Ý}? 
and 
SSxy = E(X, — XY, - P) 


illustrate how the OLS line is indeed-based on the deviations of the observations from 
their mean. SSx, for example, is found by (1) calculating the amount by which each of 
the observations for X (X;) deviates from their mean (X), (2) squaring those deviations, 
and (3) summing those squared deviations. However, these computations are quite 
tedious when done by hand. We will therefore generally use the second portion of 
each of these formulas in our calculations. 

Given the sums of squares and cross-products, it is then a simple matter to 
calculate the regression coefficient and the intercept as 


p = Sy 
l SSx [13.11] 
and È 
| bg = ¥ — bX naa | 


A word of caution: These calculations are extremely sensitive to rounding. This 
is particularly true for the calculation of the coefficient of determination, which is 
demonstrated later in this chapter. You are therefore advised in the interest of 
accuracy to carry out your calculations to five or six decimal ‘places. 


eG AN Exampce Usine OLS 


The management of Hop Scotch Airlines, the world’s smallest-air carrier, assumes a 
ae relationship between advertising expenditures and the number of passengers 
_, Who choose to fly Hop Scotch. To determine if this relationship does exist and, if so, 
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what its exact nature might be, the statisticians employed by Hop Scotch set out to us 


OLS procedures to determine the regression model. 

Monthly values for advertising expenditures and numbegs of passengers are 
collected for the n = 15 most recent months. The data are shown in Table 13-2, along 
with other calculations necessary to compute the regression model. You will note tha 
passengers is labeled as the Y-variable since it is assumed to depend on advertising, 


Advertising Passengers 


With this simple sct of data, and the subsequent computations for XY, X?, and ¥*, 
it is an easy task to determine the regression model by calculating values for the 
regression constant and the regression coefficient in the regression line ? = by + b,X. 
It is first necessary to compute the sums of squares and cross-products. 


SSx = LX? - axe 
= 2,469 — cay 
= 137.7333333 
SSy = SY? — cr 
‘= 4,960 — Coi 
= 171.733333 
SSxy = XXY — coan 


(187)(268) 
15 


Regression Data for Observation (in $1,000's) {in 1,000’s) 
Hop Scotch Airlines (months) (X) (Y) XY x? y2 
1 10 15 150 100 225 
2 12 17 204 144 289 
3 8 13 104 , 64 169 
4 17 23 391 289 529 
5. 10 16 160 100 256 
6 15 21 315 225 aa 
7 10 4 140 100 196 
8 4 20 280 196 400 
9 19 24 456 361 576 
10 Wos 17 170 100 289 
"1 eo 16 176 121 256 
12 13 18 234 169 324 
13 16 23 “+ 38 256 529 
4 10 15 150 100 225 
15 12 16 192 144 256 
187 268 3,490 2,469 4,960 
| 
| 


= 3,490 — 
= 148.933333 


4 
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Example 13.1 uses these values to demonstrate the regression procedure and the 
manner in which the regression line is calculated. 
Regression Model for Hop Scotch Airlines... 
In order to make decisions regarding allocations for the advertising budget, the 
accounting department for Hop Scotch Airlines must determine the nature of the 
relationship between advertising expenditures and the number of passengers. The 
senior accountant recognizes that regression analysis would be of invaluable assis- 
tance. 
SOLUTION: From Formulas (13.11) and (13.12), the values for bọ and b, can be 
determined as 


p = Sy 
1) SS 
_ 148.933333 
— 137.733333 
1.0813166 or 1.08 
Y— bx 
= 17.86667 — (1.08)(12.46667) 
= 4.3865 or 4.4 


bo 


The regression equation is therefore 
Ê = 4.40 + 1.08X ` 

INTERPRETATION: The model tells us that if, for example, $10,000 is spent on advertising 
(X = 10), then 

. Ŷ = 4.40 + 1.08(10) 

; = 15.2 

By multiplying 15.2 by 1,000‘since the Y-values were originally expressed in thou- 
sands, we predict on the basis of our model that 15,200 brave souls will choose to fly 
Hop Scotch when $10,000 is spent on advertising. 


Remember the meaning of b}, the regression coefficient: It indicates by how 
much Y will change for every one-unit change in the X-variable. In our case, since 
b, = 1.08, for every additional $1,000 (which is one unit since X is measured in 
thousands) that Hop Scotch spends on advertising, 1,080 more passengers will choose 
the friendly skies of Hop Scotch. Again, this value of 1,080 requires that we multiply 
the b, coefficent by 1,000, since passengers was also expressed in thousands. This is 
displayed in Example 13,2, 


EXAMPLE 


. The director of advertising for Hop Scotch wishes to determine how a change in the 
amount spent on:advertising will affect the number of passengers. Hop Scotch is 
currently spending $10,000 on advertising and is considering spending an additional 
$1,000. The final decision depends on the passenger response predicted by the 
advertising department if this additional $1,000 is spent. 
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soLuTION: Given that 
f = 4.40 + 1.08X 


the expenditure of $10,000 (X = 10) is associated with 15,200 passengers. That is, 
Ý = 4.40 + 1.08(10) 
= 152 
If advertising is increased by one unit to $11,000, the estimate of total passenger; 
becomes 


Ê = 4.40 + 1.08(11) 
= 16.28 or 16,280 passengers 
INTERPRETATION: If X is increased from 10 to 11, the predicted number of passengers is 


16,280. This is exactly 1,080 more than the 15,200 passengers predicted to fly if X = 
10. Such information is useful in determining if an increase in the advertiging budget 


is justified. 


-As you can see, the regression model can be used to predict or forecast the value | 
for the dependent variable. Given any amount in Hop Scotch’s advertising budget, we | 
can easily determine an estimate for the number of passengers who fly Hop Scotch. 
Another word of caution: This is not to imply that an increase in X causes an 
increase in Y. Although it may appear that additional advertising caused more people 
- to purchase tickets on Hop Scotch, we can only conclude that X and Y move together. 
There is no evidence of a cause-and-effect relationship. The simultancous increase in» 
X and Y may have been caused by an unknown third variable excluded from the study. 
It is a common misconception to assume that there exists a cause-and-effect relation- 
ship between the two variables. We will deal with this matter again later in this 


chapter. 
As you might well imagine, as the number of observations increases, it becomes 
quite difficult to calculate the regression model by hand, and using a computer 


becomes essential. Display 13-1 provides a portion of the output froma computer run 
using SPSS-PC to derive the regression model for Hop Scotch. Notice that the 
intercept (constant) and slope coefficients are found under the column headed by B to 
be 4.38625 and 1.08132. The other statistics reported in the printout will be discussed 


throughout this chapter. 


| oisecay 13-1 BR Number 1 Dependent Variable.. PASS 
Variables) Entered on Step Number 


Coefficients for. 1.. ADV 
Hop Scotch Multiple R -96838 
R Square -93776 
Adjusted R Square ~93297" 
Standard Error - 90678 


ses% MULTIPLE REGRESSION **** 


Equation Number.1 Dependent Variable.. PASS, 

D aei A Variables in the Equation------------- 

Variable B SE B- Beta T sSigT 

ADV : 1.08132 -07726 -96838 13.995 -0000 
-99128 4.425 -0007 


(Constant) 4.38625 
End Block Number 1 ALL requested variables entered. a 
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Let us continue with our discussion of the problem Hop Scotch is having in 
determining the exact nature of the relationship between passengers and advertising. 
The chief executive officer for Hop Scotch is concemed not only with how much 
money to budget for advertising purposes, but he must also decide how to allocate 
those funds. The airline often buys advertising space in the magazines Executive 
Weekly and Fisherman's Delight. Previous market research has shown that people 
who read Executive Weekly fall into substantially higher income brackets than those 
-who subscribe to Fisherman's Delight. The CEQ is curious about the impact that a 
person’s income has on the frequency of flying: Data are gathered for 10 passengers 
on their annual income levels, measured in thousands of dollars, and the number of 

ja flights they took during the most recent 12-month period: Since the CEO is concerned 
about the effect of income on flights, income takes on the role of the independent 
variable, and the number of flights is seen as the dependent variable. The values 
appear in Table 13-3: Given these data, the CEO wishes to determine the nature of the 
relationship between income and the tendency of people to use air service in their 
travel plans” He feels that this information would be useful in making a variety of 
decisions regarding advertising efforts and marketing policies. 


TABLE 13-3 


Passenger Flights Income (X) XY L 
Data on Income and s ants (1) t ish aa 
Flights for Hop 1 5 e 
Scotch Airlines 2 4 \ a p ee qa 
3 7 38 266 1,444 
4 10 48- 480 2,304 
5 1 59 649 3,481 
6 8 54 432 2,916 
v 7 9 42 378 1,764 
pe 8 11 63 693 3,969 
s RE 8 ENEE 416 2,704 
10 9 47 : 423 ~ 2209 
82 0 3,995 22,420 
ExAaAMBLE” > in, The Relationship betyeen Income and Figli.. 


The CEO feels that the number of flights people take depends in some manner upon 
their income, Flights is therefore scen-as the dependent variable, while income 
assumes the role of the independent variable. Simple regression analysis can provide 
the CEO with the precise knowledge he desires regarding the relationship between 
these two variables. i 


SOLUTION: The sums of squares and cross-products are 


SSx = Sx2 — CX? 


n 


= 22,429 — (460 
10 
= 1,260 
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2 _ CVE 
SSxy = SXY z 


_ (460)(82) 
10 


Then 


7-b,X 
= 8.2 — (0.177)(46) 
= 0.058 


The regression equation is 
¥ = 0.058 + 0.177X 


INTERPRETATION: The regression coefficient of 0.177 tells the CEO that there is a positive 
relationship between income and the frequency of flying. As income goes up, the , 
number of flights will increase. It is further apparent that a one-unit increase in income y 
of $1,000 is associated with a 0.177 increase in the number of flights. For each; 
additional $1,000 in income, the passengers will book 0.177 more flights. The CEO's p 
decision becomes clear. If he wishes to get the most for his advertising dollar, he 
should place his advertisements in Executive Weekly since, as noted above, people 


with higher incomes tend to read it more often. 


To further illustrate the power of regression analysis, consider the plight of the 
director of personnel for Hop Scotch. In the belief that a physical fitness program for 
the employees at Hop Scotch will reduce absenteeism due to illness, the director has 
implemented an exercise program for all interested workers. However, ‘she has had a 
difficult time proving its worth to higher management. They-seem to feel that it is just 
a waste of time and money and has done little to improve employee performance. In 
an.effort to prove that the exercise program does indeed reduce the number of days 
employees miss work due to illness, the director examincs the records of 50 | 
employees. She obtains figures on the number of hours each employee has beet 
attending the exercise programs and the number of sick days for each of thes? | 


workers. 
The director’s assertion is that illness will be reduced as exercise increases. Thit 


is, exercise impacts on illness. Therefore, the number of sick days is the dependent | 
variable. The employee records provide these data: 
n=50 XXY = 1,080 
=X = 180 
IY =450 XX? = 5,340 


| 
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With this information the director is able to use regression analysis to examine the 
relationship between illness and exercise. 


Given that any linear relationship that may exist between hours in the exercise 
ia program and sick days can be identified through regression analysis, the director of 
personnel calculates and interprets the regression model. 


SOLUTION: Compute the constant term and the regression coefficient: 
2 
SSx = 5,340 — aot 


= 4,692 
SSxy = 1,080 — CEQGSO 
= -540 
Therefore, 
er —540 
bi = Geo 
Sit = —0.1151 
itive 
ip, the by = 9 — (—0.1151)(3.6) 
ncome = 94 
cok The regression model is therefore 
lat ste f = 94 — 0.1151X 
Jeople 


INTERPRETATION: It would appear that the personnel director and her exercise program 
are vindicated. The negative sign on the regression coefficient testifies to the claim 
that as the employees spend more time in the exercise program, the number of days 
lost to illness decreases. Specifically, if one additional hour is devoted to physical 
exercise, the number of sick days will decrease by 0.1151 days. 


i - 5 
mm | Quick Check (i 3.5.1 Given the seven values for Y and X shown here, 


a. Determine the regression model, 


yen | b. What happens to Y if X increases by 1 unit? 
nese | c What is Y if X = 07 
rat | rat Y= 9, 10, 8.2, 9.5, 8, 7.2, 7 
jent X = 8, 12, 10, 11, 8, 7, 6.5 
Answer: 


a Y=415 + 0477X 


b. ¥ will increase by 0.477 units. 
c. 415 y 
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13.5.2 Given the six values for Y and X shown here, answer the same three questions 4, 
13.5.1 ‘ 


Y = 1, 2.5, 4, 3, 3.2, 5.2 


X = 5, 3,2.5, 3; 3.2,.3 


Answer: 

a = 7.215 — 1.238X 

b. ¥ will decrease 1.238 units. 
c 7.215 


13.5.3 Given the seven values for Y and X shown here, 
a. Determine the regression model. 
b. What value would you predict for Y if X = 4? 
c. When X was 4, Y was 8. Why is there a difference, and what is it called? 


Y= 9, 8, 5.3, 5, 6.1, 7.1, 8 
X = 3, 1.5, 1, 1,25, 3,4 


: 4.9126 + 0.88198 X 

b. IEX = 4, ¥ = 4.9126 + 0.88198(4) = 8.44. 

c. The difference: occurs due to the randomness of the stochastic model and is 
called the residual or error. 


The Y-Values Are Assumed to Be Normally Distributed | 


By examining the data in Table 13-2, we-can arrive at a very logical and reasonable 
conclusion. Specifically, given the same value for X on several occasions, we can see 
that it is associated with different Y-values each time. In months 1, 5, 7, 10, and 14. 
Hop Scotch spent $10,000 for advertising. However, in each case the corresponding 
number of passengers varied. Even though the same amount was spent on advertising 
each time, the number of passengers on each occasion was 15,000, 16,000, 14,000, 
17,000, and 15,000. Is this a reasonable expectation? Of course it is. If Hop Scotch 
repeatedly spent the same amount for advertising, there is absolutely no reason to 
expect the number of passengers to be exactly the same each time. The value of the 
dependent variable Y will vary even if the value-for X remains fixed. Since Y is 
different almost every time, the best our regression model can do is estimate tht 
average value for Y given any X-value. Regression analysis is based on the assump 
tion that a linear relationship exists between X and the mean value of Y, E(Y). ‘The 
regression line can be written E(Y)= Bo + BX. A point on the line denotes the 
average value for Y given any X-value. For this reason, the regression line is oef | 
referred to as a mean line. h 
The point to remember is that for any value of X that may occur several times, W 
could very well get a different Y-value each time, In fact, an entire distribution a 
` different Y-values will result, Regression analysis assumes that this distribution 4 
Y-values is normal, Thus, if X was set equal to 10 many times, many different vi v! i 
of Y would result. If these values were graphed, they would appear as a nom | 


distribution. This distribution is centered at the mean of these Y-values, as illust®® 
in Figure 13-7. Pe“: 

We can calculate this mean value of Y using our regression equation. If X 5° 
equal to 10, our estimate of the number of passengers is 
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EEE 
E 

£ The Normal 

“~ Distribution of 
Values for a Given 


Í Single Value of 


X(X = 10) 


8 


L 


E 


The Normal 
Distribution of 
Y-Values Where 
x=11 


Frequency 
of Y values 


17 18 Passengers (Y) 


Ê = bo + bX 
= 4.4 + 1.08(10) 
= 152 


or 15,200 since the data for passengers were expressed in thousands. 

This suggests that if the airline spends $10,000 each month on advertising for 
several months, the number of passengers, although perhaps different each month, will 
average 15,200. 

This normal distribution of Y-values exists for all values of X. Thus, if X = 11 on 
many separate occasions, there would occur an entire distribution of Y-values that 
would be normally distributed and centered at - 


Ê = 4.40 + 1.08(11) 
= 16.28 or 16,280 passengers 


The distribution of ¥-values that results when X-= 11'on many occasions is seen in 
Figure 13-8. 


of Y values 


16 
¥= 16.28 


17 18 19 Passengers (Y) 


When estimating the true, but unknown i ion line Ý 
; e true, but , population regression line ¥ = By + B,X 
ba aa sample regression line Ê = by + b,X, we are trying to find that line ai 
Passes through the means of the various distributions of ¥-values for each X-value. 
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This is shown in Figure 13-9. However, we must tum the distributions g 
¥-values on their sides since Y is measured on the vertical axis in Figure 13-9. Notie 
that for each value of X there is a distribution of Y-values. This is true where X = 1) 
X = 11,X = 12, or any other value. The regression line then passes through the mea, 
of each of those distributions. 

Each distribution of Y-values is normal and, like any distribution of numbers, ha; 
a variance of g? and a standard deviation of o. The important point to note here is tha 
this variance is assumed to be the same for each distribution of Y-values regardless of 
the X-value. That-is, the variance of Y-values when X = 10 is the same as the variance 
of ¥-values when X = 11 (or anything else). 


The Normal 
Distributions of 
Y-Values for the 


l 
Ilas 
j #440 
l 
Various Values of X ¥ 


l 
| 
l 
l 
| 
l 
l 
1 


12. Advertising 


As you can see, the basic mechanics of regression analysis are fairly simple. We need 
only recognize that we are describing the relationship between X and Y on the basis of 
a straight line. We calculatg“alues for the intercept and the slope of the regression line 
using Formulas (13.11) and (13.12). The resulting model can then be used to predict 
values of the dependent variable given any value for the independent variable. 
However, there is really more than this to regression analysis. Tw gain a mor 
complete picture of the nature of this important and useful statistical tool, it 1$ 
necessary to examine the basic assumptions underlying the OLS model. Thes 
fundamental suppositions do much to clarify and more fully depict the principle © 
regression analysis and to describe the conceptual framework upon which regressio? 
analysis is built. So although we have already made at least veiled reference to some 
of these principles, a much more detailed examination is called for. ; 
You are again reminded that Y, is an actual value for Y that has occurred in oe 
past and is represented by a data point in the scatter diagram somewhere above N 
below the regression line. Additionally, ¥ is a value for Y predicted to occur ot ¥ | 


Mi 
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basis of our regression model, and it is represented by a point on the line. It is the 
difference between these two values, Y; — Y, which constitutes our error. 

With this in mind, recall that the OLS procedure produces a model in which the 
sum of the errors is zero. That is, given any value for X, the actual data point Y, will 
sometimes be above the regression line (Y; > Y). This causes us to underestimate the 
value for the dependent variable, and the error term Y; — Y will be positive. At other 
times Y; will be below our regression line and we will overestimate Y, producing a 
negative error. Therefore, the errors cancel out and will average to zero. Thus, 


IY,- 7) =0 


Furthermore, OLS will minimize the sum of the squared errors. This is the whole idea 
behind OLS. If we square each error to remove the negative sign and sum these- 
squared errors, the resulting number will be smaller than the number we would get 
with any other line. We can therefore say 


XY, — Ý} = min 
In addition, there are certain assumptions upon which the OLS model is built. 


Assumption 1 The error term is a random variable and is normally distrib- 
uted. 


‘As we noted earlier, the repeated occurrence of any value for X (say X = 10) will 
be associated with many different values for Y; These values are normally distributed 
around the regression line. Since our erroris the difference between Y; and Y, the error 
itself is normally distributed. This was illustrated in Figures 13-8 and 13-9. 


Assumption 2 Any two errors are independent of each other. 


OLS further assumes that the error we experience when X = 10 is totally 
independent of the error suffered when X is equal to apy other value. 

Unfortunately, this assumption is often violated when using time-series data. 
Time-series data consist of the collection of a data series over time. If we were to 
select the prime interest rate as our data series'and collect data on it for the past 12 
months, we would be working with time-series data since its values over time (one 
year) were specified. 

Why does the use of time-series data often result in the violation of Assumption 
22 Many time series move in cyclical fashion. They are abnormally high for a period 
of time, then drop to levels somewhat below their mean. If we try to forecast such a 
variable, we are likely to overestimate it when it is abnormally low and underestimate 
it when it is abnormally, high, 3 

This is shown in Figure 13-10. If the variable is above its mean level, we are 
likely to underestimate it in our forecast attempt. Then our error Y; — ¥ will be 
positive, as for January in Figure 13-10. 

Since the variable follows a cyclical pattern, it will probably still be abnormally 

high in February. This results in another underestimation and another positive error. 
. That is, given one positive error, there is a greater than 50 percent probability that the 
Va tn be positive. Thus, the errors are not independent and Assumption 2 is 

. This continues until the variable drops below its average or typical level such 
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Figure 13-10 
+e 


Dependency of 
Errors: 3 
Autocorrelation 


as in the month of May. At this point there is a tendency to repeatedly overestimate the 
value of Y, thereby resulting in a series of negative errors. Again, the errors are not 
independent. A negative error was likely to be followed by another negative error, 
An inspection of the error terms can be quite revealing. Any detectable pattem, 
such as several positive errors followed by several negative,.or alternating positive- 
negative errors, is a sign the errors are not independent of each other. To the extent 
that this occurs and Assumption 2 is violated, the regression model is said to suffer 
from autocorrelation. Autocorrelation, also called serial correlation, occurs if errors 
are not independent. In the presence of autocorrelation, our regression model is less 


reliable. 
Assumption 3 All errors have the same variance, 


We specified earlier that the errors are normally dispersed above and below the 
regression line. The OLS progedure assumes that the variance of these Y-values where 
X = 10 is the same as it is where X is equal to any other value. Notice from Figure 
13-11 that the variance in the error terms is the same at all three income levels, [,, /2, 
and J}. In all three cases, the dispersion in the Y-values is the same, as seen by the 
shapes of the normal curves. This assumed condition of equal variance in errors is 


known as homoscedasticity. 


Jeune 3-1 fl r 


Equality of Variance 


l 
: l 
in the Error Term I 
| 
k fers] 
. ta +n 
l 
` I 


EPA A ME E: 
h Income 


ty hy 


j 
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This third assumption is, however, often violated when using cross-sectional data. 
Cross-sectional data would consist, for example, of observations for various levels of 
consumption by residents in Kansas City in the month of July. Since the data were 
collected at a single point in time (July), they do not constitute time-series data. 
Moreover, given that the data set includes a large group of people and thus cuts across 
different sections of the economic strata, encompassing lower-, middle-, and upper- 
class income groups, they are considered cross-sectional data. 

Traditional economic theory holds that wealthy individuals exhibit a different 
behavior pattern in their consumption expenditures than do the less fortunate. This 
economic principle can be used to explain how Assumption 3 can be violated. 

PS Specifically, the variation in the expenditures among the wealthy exceeds that of 
people in lower income brackets. Thus, the variance in consumption will increase as 
income goes up. This is shown in Figure 13-12. The variance in consumption patterns 
becomes greater at higher income levels, where I, < Iz < J. Notice that at /, the 
spread in the errors above and below the regression is greater than at /;. When 
Assumption 3 is violated, the model is said to suffer from heteroscedasticity. To the 
extent that heteroscedasticity occurs, the regression model is less reliable. The pres- 
ence of heteroscedasticity will often introduce an element of bias into the model, 
thereby casting doubt on the values of bọ and by. 


FIGURE 13-12 


Heteroscedasticity in 
Error Terms 


Assuuetion 4 The means of the Y-values all lie on a straight line. 


Given some value X;, there will occur a normal distribution of ¥-values: This 
distribution of Y-values has a mean, The same is true if X is set equal to any other 
value. OLS assumes that these two means, as well as all others that might be observed, 
lie on a straight line. This is referred to.as the assumption of linearity, and can be 
expressed as ; 


Pylx = Bo + BX 


where py), is the mean of the population of Y-values for any given value of X. 
‘These assumptions form the basis for regression analysis. It is upon them that the 
$ foundation of OLS is built, A complete comprehension of regression analysis is not 
possible without an understanding of these assumptions. However, these conditions 
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represent the ideal situation. It is unlikely all of them are ever fully valid in sing 
regression. We will examine ways to test for the validity of these conditions and lean 
corrective steps to take if a problem is indicated. 


H THE STANDARD ERROR OF THE ESTIMATE: 
A Measure OF GOODNESS-oOF- Fit 


The regression line, as we have already noted, is often called the line of best fit. It fits, 
or depicts, the relationship between X and Y better than any other line. However, just 
because it provides the best fit, there is no guarantee that it is any good. We would like 
to be able to measure just how good our best fit is. 

`> Actually, there are at least two such measures of goodness-of-fit: (1) the standard 
error of the estimate, and (2) the coefficient of determination. We will defer discussion 
of the latter concept until we examine correlation analysis later in the chapter. We 
embark on a description of the standard error of the estimate at this point. 

The standard error of the estimate, Se, is a measure of the average amount by) 
which the actual observations for Y vary around the regression line. It gauges the 
variation of the data points above and below the regression linc. As a measure of the) 
dispersion of the ¥-values around the regression line, it reflects our tendency to depart 
from the actual value of Y when using our regression model for prediction purposes. Inj 
that sense, it is a measure of the average amount of our error. 

If all the data points fell on a perfectly straight line as in Figure 13-13(a), our, 
regression line would pass through each one. In this rather fortunate case we would y 
suffer no error in our forecasts, and the standard error of the estimate would be zero. 
However, data are seldom that cooperative. There is going to be some scatter in the, 
data, as in Figure 13—13(b). The standard error of the estimate measures this average, 
variation. of the data points around the regression line we use to estimate Y thus, 
provides a measure of the error we will suffer in that estimation. Formula (13.13) 
illustrates this principle. Notice that the numerator reflects the difference between the , 
actual values of Y, Y, and our estimate Y. 


ry 
a-2 [13.13] 


Unfortunately, Formula (13.13) is computationally inconvenient. It is necessary 
develop an easier method of hand calculation. Recall that o? is the variance of tt 
regression errors. One of the basic assumptions of the OLS model is that this variant? 
in the errors around the regression line is the same for all values of X. The smaller the 
value for o°, the less is the dispersion of the data points around the line. r 

Since o? is a parameter, it will likely remain unknown, and it is necessary p 
estimate its value with our sample data. An unbiased estimate of o° is the mean squa® 
error (MSE). In our previous chapter on ANOVA, we learned that the MSE is the es 


d 
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: 
@ 


Possible Scatter 
Diagrams 


sum of squares (SSE) divided by the degrees of freedom. In the context of regression 
analysis, SSE is 


: À Ba ESY? 
i - A Ea [13.14] 


Ina simple regression model, two constraints are placed on our data set since we 
must estimate two parameters, By and B,- There are, therefore, n — 2 degrees of 
freedom. MSE is x |] 


: S MSE = SSE 
n-2 . [13.15] 


‘The standard error of the estimate is then 


| i Se = VMSE : [13.16] 


Example 13.5 demonstrates the calculation of Se using our data set for Hop 
Scotch. à 


MPLI The Standard Error of the Estimate for Hop Scotch Airlines 


E After the accounting department for Hop Scotch calculates the regression model 

(Example 13.1), the marketing division questions how accurate the model is in 

.. forecasting numbers of passengers. The marketing division needs a precise estimate of 

Passengers in order to compare how competitive they are with the rest of the industry. 

_ The head of the marketing division feels that the standard error of the estimate 

» Will serve nicely as a Measure of the closeness or precision of their estimate of the 
numbers of passengers which they obtained by using the regression model. 
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soLuTION: Using the data in Table 13.2 we have 


2 
SSE = SSy ~ Sor. 


_ 048.9333)? 
171.73333 ~ "137.7333 
10.6893 
_ 10.6893 
MSE = -2 
= 0.82226, 
Se = VOSI% 


= 0.90678 or 0.907 ! 
‘or the statistical interpretation of Se, continue reading. 


The standard error of the estimate is quite similar to the standard deviation of « 
single variable that we examined in Chapter 3. If we were to collect data on the 
incomes for n = 100 people, we could easily calculate the standard deviation. This 
would provide us with a measure of dispersion of the income data around their mean. 

In regression analysis we have two variables, X and Y. The standard error of the 
estimate is thus a measure of the dispersion of the ¥-values around their mean, given » 
any specific X-value. 

Since the standard-error of the estimate is similar to the standard deviation for a 
single variable, it can be interpreted similarly. Recall that the Empirical Rule states if 
the data are normally distributed an interval of one standard deviation above the mean 
and one standard deviation below the mean will encompass 68.3 percent of all the 
observations; an interval of two standard deviations on each side of the mean contains 
95.5 percent of the observations; and three standard deviations on each side of the 
mean encompass 99.7 percent of the observations. 

The same can be said for the standard error of the estimate. In our present 


example, where X = 10, 
Î = 4.4 + 1,08(10) 


= 15.2 


Remember, this value of 15.2 is the estimate of the mean value we would get for Y i 
we set X equal to 10 many times, To illustrate the meaning of the standard error of th’ 
estimate, locate the points that are one Se (that is, 0.907) above and below the mes 
value of 15.2. These points are’ 14.29 (15.2 — 0.907) and:16.11 (15.2 + 0.907). If w 
were to draw lines through each point parallel to the regression line a8 in Figu 
13-14, approximately 68.3 percent of the data points will fall within these lines. Th 
remaining 31.7 percent of the observations will be outside this interval. In our cis" 
68.3 percent of the times when $10,000 is spent on advertising, the number 4 
passengers will be between 14,290 and 16,110. The remaining 31.7 percent of 
time, the number of passengers will exceed 16,110 or be less than 14,290. i 

Given our interpretation of Se, it follows that the more dispersed the original w 
are, the larger Se will be. As indicated by the scatter diagrams in Figure 13-1 Le 
data for Figure 13-15(a) are much more dispersed than those in Figure 13-15(b)- af 
Se for Figure 13—-15(a) would therefore be larger. After all, if you are to encom” 
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Standard Error of 
the Estimate 


A Comparison of the 
Standard Error of 


the Estimate 


68.3 percent of the observations within one Se of the regression line, the interval must 
be wider if the data are more spread out. 


13.7.1 Given the values for Y and X, calculate and interpret the standard error of the 
estimate. 


Y = 5, 6.3, 8, 7.5, 9, 8.1, 8, 8.5, 10 
X = 26, 3.9, 5, 4.9, 5.2, 5, 5.5, 5, 6 


Answer: The model is Ŷ = 1.1386 + 1.39565X; Se = 0.50543. (Your answer may 
differ slightly due to rounding.) It states that given any value for X, 68.8 percent of 
the observations for Y will fall within 0.50543 units of 1.1386 + 1.39565X. 


13.7.2 Using the values for X and Y below, determine within what interval 95.5 percent of 
the observations for Y will fall if X = 4. 


Y = 5.4.5, 5.5, 6.2, 7.6, 5.5,7 
X = 19, 1.5, 3, 3,8, 4:5, 4, 4 


Answer: = 3.211 + 0.829X; Se = 0.6042; the interval is 5.319 to 7.735. 
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CORRELATION ANALYSIS 


Our regression model has given us a clear perception of the relationship -betwee, 
advertising expenditures by Hop Scotch Airlines and the number of courageoy; 
travelers who queue up at the Hop Scotch ticket counter. It is now evident that sinc: 
the regression coefficient b, is positive, there is a direct relationship between adverts. 
ing dollars and passengers. This means that as more money flows into the advertising 
budget, more ticket buyers select Hop Scotch. Or, conversely, a decrease in adverts. 
ing expenditures is accompanied by a reduction in passenger traffic. The two variable; 
move together. To be more precise, since b, = 1.08, for every one-unit increase in 
advertising, the number of passengers increases by 1.08 units. That is, for ever 
additional $1,000 in the advertising budget, 1,080 more passengers buy Hop Scotch 
tickets. However, you are again reminded that our findings do not allow us to con- 
clude that advertising causes ticket sales to rise. Even though there may appear to bea 
cause-and-effect relationship, we can only conclude that advertising and passengers 
are correlated or related in some given manner. The true cause of this relationship may 
lie in some third variable that influences both advertising and ticket sales. 

Now that we have a general understanding of the basic nature of the relationship 
between advertising and number of passengers; it is beneficial to measure the strength 
of that relationship. This is the job of correlation analysis. This measure of strength is 
provided by the coefficient of determination. The coefficient of determination is one 
of the measures of goodness-of-fit we mentioned earlier along with the standard error 
of the estimate Se. 


The Coefficient of Determination 
To understand correlation analysis, we must first consider the total deviation of F. 
This important concept is the amount by which the individual Y-values vary from their 
mean F; that is, Y; — Y. 


2Y; = 


: 268 


the total deviation for the thirteenth month is 23 — 17.87 = 5.13. This is shown; 
Figure 13-16. The value for Y, of 23 lies 5.13 above the horizontal line representing 
of 17.87, This total deviation between Y; and F can be broken down into two type 
‘The explained deviation is that portion of the total deviation that is explained by a 
model. It is the difference between what our model predicts, Y, and the mean value fi 
Y—that is, ? — F, In this manner, the explained deviation measures the amount of 
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total difference between Y; and F that is explained by the regression model. Since X = 
16 in the thirteenth month, = 4.4 + 1.08(16) = 21.68. The explained deviation is 


` therefore 
f — F = 21.68 — 17.87 


a 
Y¥=44+108x 


Unexplained deviation =(Y, - ¥)= 1.32 


Of that total deviation of 5.13 for the thirteenth month, our model explains 3.81. 
The rest of the total deviation remains unexplained. The unexplained deviation is 
that portion of the total deviation of Y; from Y not explained by our model. It is the 
additional deviation from F over and above what our model is able to account for. It is 
found as the difference between what Y actually was (Y,) and what our model 
predicted (Ê), that is, Y; — Ê. You will recognize this value as our error. 


4 Notice that had we tried to predict Y merely on the basis of Y, our error would 
| have been Y, = Y = 23 — 17.87 = 5.13. Our regression model, on the other hand, 


forecasts a value for Y of 


. P= 44 + 1.08(16) — 
= 21.68 


Using our regression model, our error is only Y, — ¥ = 23 — 21.68 = 1.32. We are 
closer to the actual value for passengers when we use our model-than we would be if 
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we just took the average value for Y as our prediction. Our model does have Sy 


value as an explanatory tool. 
Just how much value it has is measured by the coefficient of determination , 


examination of this important statistical concept requires that we calculate the sum, 
squares. In that regard, notice that the 


Total deviation = Explained deviation + Unexplained deviation 


That is, 


| M- Y= -%) + -%) [13:17 


If we were to find each of these three types of deviations for all 15 observatio, 
(months), square them, and sum each of them, we would have the sums of square 
Thus, the total sum of squares (or sum of the squares of the total deviation, SST) 


SST = Z(Y, - YY [13.18] 


The regression sum of squares (or sum of the squares of the regression, SSR) is 


f SSR = X(f - YP ; [13.19] 


This regression sum of squares is also called the explained deviation, since it is th 


regression model that is doing the explaining. 
The error sum of squares (or sum of the squared errors, SSE) is 


SSE = XY, - ¥? [13.20] 


The error sum of squares is also called the unexplained deviation since that portion c 


the total deviation left unexplained is our error. 
The squaring process is necessary to prevent negative errors from offsetting th 


positive errors and leaving us with nothing but zeros to work with.- 

Now recall what we set out to do. We want to measure how closely our line fi 
the data. We want a measure of how well our model explains changes in the depender 
variable. Stop and think. How could we use SST, SSR, and SSE to measure t 
explanatory power of our model? That is, how could we measure to what extent 0- 
model explains changes in Y? Keep in mind that SST is the total amount of Ù 
deviation in Y that needs tq be explained, and SSR is the amount of that deviation the 
is explained by our model. Gguldn’t we then measure the explanatory power of č 
model with a ratio of the portion of the variation that is explained by our model (- 
measured by SSR) to the’total variation in Y (as measured by SST)? This is ex: 
what the coefficient of determination does! It is a ratio of the explained deviatio" 
the total deviation. 

The coefficient of determination, 72, measures that portion of the total devi“ 
in Y that is explained by our model. In this sense, it is a measure of the explant“ 
power of the regression model. 


7 


dij 
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a Explained variation _ SSR 
~ Total variation SST [13.21] 


In terms of our sums of squares and cross-products, it can be calculated as 


(SSxy)? 


= (SSx)(SSy) 


The value for r? must be between 0 and 1 since more than 100 percent of the change in 
Y cannot be explained. 

If 7? = 70 percent, this means 70 percent of the variation in Y is explained by 
changes in X. Of course, the higher the 7°, the more explanatory power our model has. 


In this manner, r? measures the strength of the linear relationship between X and Y. 
Note that r? has meaning only for linear relationships. Two variables may exhibit a 
coefficient of determination of zero and still be related in a nonlinear manner. 


EXAMPLE 


T 


The marketing division for Hop.Scotch has completed the regression analysis. This 
gave them a good idea as to the nature of the relationship between advertising and 
passengers. However, the marketing director now wants to know how strong that 
relationship is. He is concerned with how much reliance he can place on that 
relationship in the decision-making process. This information can be derived through 
an examination of the coefficient of determination. $ 


SOLUTION: Given the data for Hop Scotch from Table 13-2, we have 


2- Sin? 
(SSx)(SSy) , 
i (148.9333)? 
~ (137.73333)(171.73333) 
= 0.93776 ~ 0.94 


Notice that all the values were carried several decimal places due to the aforemen- 
tioned sensitivity of 7? to rounding. 
ae INTERPRETATION: The coefficient of determination reveals that 94 percent of the change 


-in the number of passengers is explained (not: caused) by changes in advertising 
expenditures. 
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As you were warmed earlier, we cannot conclude that a change in X cay 
change in Y. Correlation does not imply causation. It may be that a third factor, y, 
unknown to us, is causing X and Y to behave as they do. We only know 
advertising and passengers are correlated. 
g Since 7? = 0.94, our model explains 94 percent of the change in Y. The oti, 
percent can be explained by some variable(s) other than advertising. This 6 perce 
sometimes referred to as the coefficient of nondetermination, K. 


E] The Coefficient of Correlation 


In many instances we have need to calculate the coefficient of correlation (or me 
the correlation coefficient). Developed by Karl Pearson around the turn of the cen: 
it is sometimes called the Pearsonian product-moment correlation coeffici: 
Designated as r, the correlation coefficient is simply the square root of the coeffic 
of determination. 


The value for r ranges between +1 and — 1. After calculating r using Formula (13.2 
you must give it the same algebraic sign as the regression coefficient, b,. Sinc: 
always carries the same sign as the regression coefficient, it will reflect the slop: 
the regression line. If r > 0, b,, will be positive and the line-will slope up. If r < 
b,, will be negative and the regression line will be negatively sloped. 

The absolute value of r indicates the strength of the relationhsip between X anc 
while the sign tells us whether they are related in a direct or inverse fashion. Fig 
13-17 illustrates possible values for r. An r-value of +1 indicates perfect posi: 
correlation, while r = —1 suggests perfect negative correlation. If r = 0, no lir. 
relationship between X and Y is suggested. f 

Refer back to Display 13-1 on page 632, which provided a portion of the SP‘ 
PC printout for Hop Scotch Airlines. The correlation coefficient and the coefficier 
determination are shown as Multiple R = 0.96838, and R square = 0.93776. No 
also that the standard error discussed above of 0.90678 is also displayed. The Adju- 
R Square value is of little or no consequence in simple regression; we will discuss :'... 
the next chapter when we deal with multiple regression analysis. 


13.8.1 A study involving six employees Finds the following values for job perform- 
rating (JPR) and years experience. Using these data, how well does expen: 


explain job performance? je 


JPR = 1.5, 4, 4.5, 5, 6.2, 7 


Experience = 2.5, 3.5, 4, 4.5, 5, 5.9 years 
Answer; r° = 0,964 


13.8.2 A second study related the commuting time, in hours, for shoppers to reach th 
shopping center to the duration of the visit, in hours. Given the values sho’ 
a. Which is the dependent variable and which is the independent? 
b. Calculate and interpret the coefficient of determination. 


Commuting time: 0.5, 0,2, 0.2, 1.5, 2, 21,2, 15 
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Potential Values ak 
for r 


x x 
Perfect positive correlation Perfect negative correlation 


No linear correlation 


Length of visit: 1.2, 5, 5.5, 4, 3.3, 4.2, 
Answer: 
a. It would seem that the duration of the visit would depend on the commuting 


time required to reach. the shopping centér. If the time was long, the shopper 
would want to conduct as much business there as possible. 


b. r = 0.00059. Less than 1 percent of a change in length of visit is explained by 
a change in commuting time. 


1,5. 


LIMITATIONS OF REGRESSION ANALYSIS 


Although regression and correlation analysis often prove extremely useful in decision 
making for a wide variety of business and economic. matters, there are certain — 
limitations to their application and interpretation. As already noted, regression and 
correlation cannot determine cause-and-effect relationships. Correlation does not 
imply causation. This point was dramatically made by a British statistician who 
‘‘proved’’ that storks bring babies. He collected data on birthrates and the number of 
storks in London and found a very high correlation—something like r = 0.92. He 
-therefore concluded that the fairy tale about storks and babies was true. 
5 However, as you may have already suspected, that’s not really the way it works. It 
. seems that this brand of stork liked to nest in the tops of Londoners’ chimneys. 
Therefore, where population was dense and the birthrate was high, there were many 
. chimneys to attract this fowl—thus, the high correlation between birthrates and storks. 
Actually,’ both storks and births were caused by a third factor, population density, 
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which the researcher conveniently ignored. Remember, correlation does not m 
causation. g 
Additionally, you must be careful not to use your regression model to predict y 
the basis of values for X outside the range of your original data set. Notice tha, 
values for X in the Hop Scotch data set range from a low of 8 to a high of 19. We h, 
isolated the relationship between X and Y only for that range of X-values. We have, 
idea what the relationship is outside that range. For all we know, it might appear, 
shown in Figure 13-18. As you can see, for values outside our range of 8 to 19,, 
X-Y relationship is entirely different than what we might expect given our sample 


Figure 13-18 ¥ 


A Possible X-Y 
Relationship 


| 


Another failing of regression and correlation analysis becomes apparent when tw: 
obviously unrelated variables seem to exhibit some relationship. Assume that for som 
strange reason you wish to examine the correlation between the number of elephant 
born in the Kansas City Zoo and the tonnage of sea trout caught by sports fishermen ir 
the Gulf off Tallahassee, Florida. Lo and behold, you find an r = 0.91. Would yo. 
conclude that there is a relationship? Such a conclusion is obviously bizarre. Despit: 
the r-value, pure logic tells us that there is not really any relationship between thes: 
two variables. You have merely uncovered spurious correlation, which is correlatior 
that occurs just by chance. There is no substitute for common sense in regression anc 


correlation analysis, 


6 One of the basic purposes in conducting regression analysis is to forecast and predic 
values for the dependent variable. As we have seen, once the regréssion equation he: 
been determined, it is a very simple matter to develop a point estimate for th 
dependent variable by substituting a given value for X into the equation and solvin: 
for Y. ig 
In addition, the researcher may be interested in interval estimates. We hav 
already seen that they are often preferable to mere point estimates. There are at Teas! 
two such interval estimates commonly associated with regression procedures. 

‘The first one is an interval estimate for the mean value of Y given any X-valu? 
That is, we may want to estimate the population mean for all Y-values (not just the 
n = 15 in our sample) when X is equal to some given value, We may be intereste: 
the average number of passengers in all months in which we spend $10,000 ©’ 
advertising (i.e., X = 10). This is called the conditional mean. 


» 
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A second important confidence interval seeks to estimate a single value of Y given 
that X is set equal to a specific amount. This estimate is referred to as a predictive 
interval. Thus, while the conditional mean is an estimate of the average value of Y in 
all months in which X is equal to a specified amount, the predictive interval estimates 
Y in any single month in which X is set equal to a given amount. 


A | The Conditional Mean for Y 


Suppose we wanted to develop an interval estimate for the conditional mean of Y, 
Hylæ This is the population mean for all Y-values under the condition that X is equal to 
a specific value, Recall that if we let X equal some given amount (say X = 10) many 
times, we will get many different values of Y. The interval we are calculating here is 
an estimate of the mean of all of those many Y-values. That is, it is an interval estimate 
for the mean value of Y on the condition that X is set equal to 10 many times. 

Actually, the confidence interval for the conditional mean value of Y has two 
possible interpretations, just as did those confidence intervals we constructed back in 
Chapter 8. Assume that we are calculating, for example, a 95 percent confidence 
interval. 


First IntenpReTATION As noted above, if we let X equal the same amount 
many times we will get many different Y-values. We can then be 95 percent 
confident that the mean of those ¥-values (1,1) will fall within the specified 
interval. 4 


Secon INTERPRETATION If we were to take many different samples of X and Y 
values and construct confidence intervals basd on each sample, 95 percent of 
them would contain j1,),, the true but unknown mean value of Y given X = 
10. 


To calculate this interval for the conditional mean value of Y, we must first 
detexmine Sy, the standard error of the conditional mean. The standard error of the 
conditional mean recognizes that we use a sample to calculate by and b, in the 
regression equation. Thus, by and b, are subject to sampling error. If we were to take a 
different set of n = 15 months and determine a regression equation, we would likely 
get different values for by and b,. The purpose of Sy is to account for the different 
values for bọ and b, resulting from sampling error. It is determined by 


[13.24] 
where Se- is the standard crror of the cstimate 
X, is the given value for the independent variable 
The confidence interval for the conditional mean is then 
[ Cl, for py, = Ê Etsy [13.25] 
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in which F is the point estimator found from our original regression equation, ay, 
t-value is based on a selected level of confidence with n — 2 degrees of fre, 
There are n — 2 degrees of freedom because we must calculate. two values, by an, | 
from the sample data. We therefore lose two degrees of freedom. a 


EXAMPLE 


Since Hop Scotch seems to spend 
requests that the quantitative analysis section of the marketing division develop; 
percent confidence interval for p1,), on the condition that X = 10. The analyst y 
estimate the true mean for Y if X = 10. 


SOLUTION: Formula (13.24) gives 


Q, - x? 
SSx 


‘The value of Se was calculated to be 0.907 in Example 13.2, and X has been set a 
Then using the data from Table 13-2, we have 


1 
Sy = Se at 


1 , (10 - 12.47) 


Sr = 0.907475 + "13773333 BE 
= 0305 
Since 
Ê =b + bX 


= 4.4 + 1.08(10).= 15.2 


Formula (13.25) gives 

Cl py. = Pty 

= 15.2 + 10.303) 
Given a 95 percent confidence level (a = 0.05) and n — 2 = 3 degrees of freedor 
the s-table yields 1 = 2.160. Then 
C.L py}, = 15.2 + (2.160)(0.303) 
= 15.2 40.65 
14.55 < pyly < 15.85 


INTERPRETATION: Hop Scotch can be 95 percent confident that the te Population mt- 
for Y is between 14,550 passengers and 15,850 passengers for all those months 
which they spend $10,000 for advertising purposes. 


We could calculate the confidence intervals for p,), at several X-values. T 
would give us several confidence intervals. These intervals would then form an c" 
confidence band for j} Notice in Figure 13-19 the band becomes wider att 
extremes. This happens because regression analysis is based on averages, and 
farther we get away from the center point of X = 12.47, the less accurate our fi 
Therefore, to retain our 95 percent confidence level, the band must be wider. It is 
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[cure 12-19 I 


Confidence Limits 
for pyix 


narrow at X = X = 12.47. If you were to calculate the 95 percent interval at X = 8, 
you would find it to be wider than the one we just got at X = 10. 


1.10.1 Given the values for Y and X, calculate a 99 percent confidence interval for the 


conditional mean value of Y at X = 9. 


Y = 8, 5, 65, 8, 7.2 


X = 10, 8, 8.5, 8.4, 8 
Answer: 3.707 to 10.89 


E| The Predictive Interval for a Single Value of Y 


The confidence interval constructed above is for the population mean value of all 
Y-values when X is equal to a given amount many'times. At other times it might be 
useful to construct a confidence interval for a single value of Y that is obtained when X 
is set equal to some value only once. Hop Scotch may be interested in predicting the 
actual number of customers next month if they spend $10,000 on advertising. This 
differs from the problem above in which the concern was with the average value of Y 
if X was set equal to 10 many times. : 

Our interest now focuses on a prediction for a single value of Y if X is set equal to 
a given amount only once. That is, instead of trying to predict the mean of many 
Y-values obtained on the condition that X is set equal to 10 many times, we are now 
trying to predict a single value for Y which is obtained if X is set equal to 10 only once. 
Now, stop and think about this problem for a minute. Averages, by their very nature, 
tend to be centered around the middle of a data set. They are therefore easier to predict 
since we know about where they are. Individual values, however, are quite scattered 
and are therefore much more difficult to predict. Hence, a 95 percent confidence 
interval for a single value of Y must be wider than that for a conditional mean. 

This confidence interval for the predictive interval of Y also carries two interpreta- 
tions. For the purpose of illustration, these interpretations are provided under the 


assumption that the intervals we calculate are 95-percent intervals, although other 
levels of confidence may of course be used. 
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Finst lnrenpreraion If we were to set X equal to some amount just one tim, 
we would get one resulting value of Y. We can be 95 percent certain that tha, 
single value of Y falls within the specified interval. 


Secono InrerpreTaTION If many samples were taken and each was used ty, 
construct a predictive confidence interval, 95 percent of them would contain 
the true value for Y. 


In order to calculate this predictive interval, we must first calculate the stani 
error of the forecast, 5,, (not to be confused with the standard error of the conditi 
mean, Sy). This standard error of the forecast accounts for the fact that indivi. 
values are more dispersed than are means. The standard error of the forecast refh 
the sampling error inherent in the standard error of the conditional mean Sy, plu; 
additional dispersion that occurs because we are dealing with an individual value, 
Formula (13.26) is used in its calculation. 


Sy, = Se 
[13.2 
The predictive interval for a single value of Y, Y, is then 
C.L for Y, = ¥ +15, [13.24 


Let's now construct a 95 percent confidence interval for a single value of Y when | 
10 and compare it with the interval for the conditional mean constructed earliery 


The Predictive Interval for Y Given an X-Value 


After receiving the interval estimate for the conditional mean from the marke} 
division, the CEO now demands to know what the estimate is for passengers the r 
time they spend X = $10,000 for advertising. The head of the marketing diviş 
realizes that what the CEO is asking for is the predictive interval estimate for a sin 


yalue of X. 
SOLUTION: The division head therefore proceeds as follows: = 
10 — 12.477 
5, = Seyi (0 — 12.47" 
% 15 +137.73333 
` = 0,907 V1.1114 
= 0.956 
` Since 


Ê = 4.4 + 1.08(10) 
= 152 


= 
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we obtain 


C.I. for Y, 


" 


ftes, 

15.2 + (2.160)(0.956) 
= 15.2 + 2.065 

13.14 < Y, < 17.27 


1 


INTERPRETATION: We can be 95 percent certain that if in any single month X = $10,000, 
the resulting single value of Y will be between 13,140 and 17,270 passengers. 


As promised, this interval is wider than the first because we are working with less 
predictable individual values. The comparison is complete in Figure 13-20. 


| Fisure 13-20 | Interval Estimates for pyix and Yy 


13,10.2 Using the data from 13.10.1, calculate the 99 percent interval for a single value 
of Y. 


Answer: —0.557 to 15.16 


G Factors Influencing the Width of the Interval 


Given a level of confidence, it is preferable to minimize the width of the interval. The 

. narrower the interval, the moré accurate is our prediction of py), or Y, However, 
several forces are‘working against us in our effort to produce a narrower interval. 

The first is the degree of dispersion of the original data. The more dispersed the 

original data are, the greater will be Se, the standard error of the‘ estimate. Given the 
arithmetic in Formulas (13.24) and (13.26), a higher Se results in a wider interval. 
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Our sample size is a second factor in determining interval width. As we have ș, 
in previous chapters, a large sample size results in a smaller standard error. Ag; 
given the arithmetic described above, a small standard error results in a small inten, 

Furthermore, as we have already seen, a value for X relatively close to X y, 
produce a small interval since regression is based on averages. Therefore, a th 
factor influencing interval width is how far the particular value of X that we ; 


interested in is from X. 


6 HyYPotHesis TEST ABOUT THE POPULATION 


CORRELATION COEFFICIENT 


Since our correlation coefficient of r = 0.97 is not zero, we can conclude on the ba, 
of our sample data that there is a relationship between X and Y. However, rememb, 
that this conclusion is based on only n = 15 observations. Only 15 months of de 
were used in our study. As always, our interest is with the entire population of ¿ 
X-values and all ¥-values. It is possible that due to sampling error, our sample may t 
misleading. Although our sample data reveal a relationship, there may be no suc 
relationship at the population level. 

We must therefore examine the possibility that despite the fact our samp! 
suggests a relationship between X and Y, it may be that no such relationship exist 
Could it be that if we plotted the scatter diagram’ for all X, Y data points it wou 


appear as in Figure 13-21? 


FIGURE 13-21 
F 


A Possible Pattern 
for the Population of 
All Data Points for 
Hop Scotch Airlines 


This pattern of data points reveals that the correlation is zero, and no relationst 
exists between X and Y. However, isn't it entirely possible that, just due to the luck: 
' the draw, our sample might just happen to include those 15 data points enclosed in th 
ellipse? Indeed it is! Of all the data points in the population of X, Y-values, it 
entirely possible that we might just happen to select those 15 indicated in the circ! 
Consider the consequences of selecting these 15 observations as the sample. A scat 
diagram with these sample data would falsely suggest a positive relationship bet: 
X and ¥. 
The resulting problem should be obvious. Our sample has misled us. The samp 
correlation coefficient r would be positive, but the population correlation coefficient 
(the Greek letter rho) would be zero, While there actually is no relationship betwe*" 


p 
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and Y, our sample incorrectly reports a positive correlation. We would mistakenly 
conclude that a relationship did exist between X and Y. 

Therefore, despite the fact that the sample correlation coefficient was not zero, it 
is often desirable to test the hypothesis that the population correlation coefficient is 0. 
Our hypotheses are 


Hy: p =0 
p Hip #0 
x Although our analysis has shown that the sample correlation coefficient is not zero, 


~. the hypothesis test is done to determine if it is significantly different from zero. This 
test employs the t-statistic 


f= 


me 
S, [13.28] 


and has n — 2 degrees of freedom, where S, is the standard error of the sampling 
distribution of r. It recognizes that if several samples of size n = 15 were taken, we 
would get different values for r. That is, we can get many different samples from the 
population, each with its own r-value. If p = 0, the r-values would be distributed 
around p, ranging from —1 to +1:as shown in Figure 13-22. S, is found by 


[13.29] 


The Distribution 
of the Sample 
Correlation 
Coefficient 


We choose level of confidence, such as 95 percent (a = 0.05), at which to test 
the null hypothesis p = 0. This choice allows us to find a critical value of t from the 
t-table. This critical value of t is compared with the 1 we calculated from Formula 
(13.28) based on our sample data. ; i 

For example, if we were to test the null at the 95 percent level of confidence, we 
find from the table that the critical t-values, given 15 — 2 = 13 degrees of freedom, 

. are +2,160 as demonstrated in Figure 13-23. This means that if p does equal zero, 95 
percent of the samples of size n = 15 that you could take would yield data that provide 
a t-value between —2.160 and +2.160. There is only a 5 percent chance that if p = 0, 
your sample would yield a t-value below —2.160 or above 2.160. If in using the 
sample data to solve Formula (12.28), you get a f-value outside that range, you can be 
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Critical ¢Values for 
Testing the 
Hypothesis that 
p=0 © 


95 percent certain that p # O, thus indicating that there is a relationship between X a 
Y at the population level. On the other hand, if your t-value is between —2.160 a 
+2.160, you cannot reject the null hypothesis p = 0. Despite your sample results, y: 
would not have sufficient evidence to conclude at the 95 percent level of confiden. 
that a relationship exists between X and Y. 


Despite the r-value of 0.97 found by the marketing division, the head of the finan 
section for Hop Scotch is skeptical. He wants to test the hypothesis that p = 0, eve 
though the sample collected by the marketing people strongly suggests a relations! 


SOLUTION: If the statisticians in the finance section were to test the hypothesis that p = 
at the 95 percent level, they would have 
Ho: p =0 2 
H; p #0 


h-r 
S-=Vn-2 
=- [0T 
Te 15-2 


= 0.06919 


Using Formula (13.28), we have 


From Table F we find that a 95 percent level of confidence carries a critical value ‘! 
+2.160. Then the decision rule becomes 


Decision Rute Do not reject the null that p = 0 if the t-value is between 
—2.160 and +2.160. Reject p = 0 if the t-value is less than —2.160 or 


exceeds +2.160. 


jj 
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-2.160 0 2.160 t 


INTERPRETATION: Since t = 13.995 > 2.160, the head of finance can be 95 percent 
an certain that there is a relationship between X and Y. He must reject the null that p = 0 
and conclude with 95 percent certainty that there is a relationship between X and Y. 
This test is said to be significant at the 5 percent level. 


| Rue Grek | 13.1121 Using the data from Quick Check 13.8.1, test the hypothesis that p = 0 at the 95 


percent level. 
Answer: t = 10.35 > 2.776. Reject null that p = 0. 


2 TESTING INFERENCES ABOUT THE 
POPULATION REGRESSION GOEFFIGIENT 


Much of the work done to test inferences regarding the population correlation 
coefficient can also be applied to inferences concerning the population regression 
coefficient. The purpose and rationale are much the same. Our conclusions regarding 
the relationship between X and Y are based on sample data. It is possible that the 
implications drawn from these sample data are misleading due to sampling error. Our 
sample produced a nonzero regression coefficient of b, = 1.08, thereby suggesting a 
relationship between X and Y. However, perhaps our sample is in error and there 
actually exists no relationship at the population level. It is necessary to test the 
hypothesis that the population regression coefficient B, is actually zero even though 
b,, the sample regression coefficient, was not zero. If it is concluded that B, is not 
zero, we can then surmise that our sample conveys the correct impression in suggest- 
ing a relationship between the dependent and independent variables. 


Hypothesis Test for B4 


If the slope of the actual but unknown population regression line is zero, there is no 
relationship between X and Y. However, due to the luck of the draw in the sample, we 
might select sample data that suggest a relationship. This might happen as shown in 
Figure 13-24. While the population of data points shows no relationship between X 
and Y, and B, = 0, we might just happen to have selected a sample such as that 
represented by the n = 15 points circled in Figure 13-24. As you can plainly see, the 
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rouce 3-24 so 


A Possible Pattern 
of Population Data 
for Hop Scotch 
Airlines 


Advertising 


sample regression would be positively sloped, b, > 0, and a relationship woulj. 
suggested by OLS. It is therefore often a wise practice to test the hypothesis that £, 
O given b, #.0. Again, as with the sample correlation coefficient, the intent is 
determine if the sample regression coefficient is significantly different from zero, T- 
test involves 

Ho: By = 0 

H: B, #0 


and uses a t-statistic defined as 


[13.30] 


where S,, is the standard error of the regression coefficient b,. This standard erc 
of the regression coefficent (not to be confused with the standard error of the estima: 
Se) recognizes that the regression coefficient b, will vary from one sample to the nex 
If we were to select a second sample of n = 15 observations and calculate the OL: 
model, we would likely not get a b,-value of 1.08. Due to sampling error, the value fi 
b, will vary from sample to sample. S,, measures that variation in the regressi 


coefficient. It is calculated as 


5, = “ee 
*i VSSx [13.31] 


where Se is the standard error ofthe estimate, which we calculated at the beginning‘ 


this chapter to be 0.90678. ir 
After deciding on the level of confidence we wish to employ, a critical value!" 


is obtained from the table and compared with the t-value calculated from the samf 
by using Formula (13.30). 


E 
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The head of finance is tough to convince. Despite the fact that the sample regression 
coefficient was not zero, he now wants to perform a hypothesis test to determine if it is 
significantly different from zero, allowing him to conclude with some certainty that 
the population regression coefficient is not zero. 


SOLUTION: In running this test for Hop Scotch Airlines we specify 

Ho: Pı = 0 

H,: B, #0 
The test will be conducted at the 99 percent level of confidence in order to provide the 
skeptical head of the finance section with the maximum degree of assurance. 

iets Se 
bi VSS% 
_:__0.90678 
~ V131.73333 


Then 


1.0813165 
0.07726 


13.995 


i 


If a = 0.01, the critical values from the t-table are +3.012. 


Decision Rute Do not reject Ho: B, = 0 if the t-value is between —3.012 and 
+3.012. Reject if the t-value is outside that range. 


-3.012 0 3.012 13.5 


INTERPRETATION: Since t = 13.995, we reject Hp: B, = 0 and conclude that there is a 
relationship between X and Y. There is only a 1 percent chance that if B, = 0 our 
sample would yield a t-value outside the specified range. The t-value of 13.995 allows 
us to reject B, = O with a 99 percent level of confidence. 
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The t-value of 13.995 we obtained in our test of the regression Coefficien, 
B, = 0, is identical to the r-value reported in our test of the correlation coeffi, 
Hg: p = 0. This is nọ coincidence. It will always occur in simple regression. It ca 
shown that r = b, VSSxSsy. j 

Therefore, if b, = 0, r must be 0. Consequently, the null hypothesis Ho: B, =; 
equivalent to Hp: p = 0. In practice, it is really necessary to perform only one of h 
tests. 

Display 13-2, which is identical to 13-1 and is repeated here for convenien, 
reveals the results of our SPSS-PC run. It shows under column SE B that the stand, 
error of the regression coefficient, which we calculated using Formula (13.31), 
0.07726, and the s-value is 13,995. The significance of the regression coefficien, 
given in column Sig T. The lower this Sig T value, the “‘more significant’ is , 
regression model. This Sig T value is analogous to the p-value we encountered in, 
discussion of hypothesis testing. It is the lowest value at which we can reject the n, 
hypothesis that B, = 0. If the Sig T for ADV was, say, 0.06, then we would reje 
B, = O if the hypothesis was tested at a = 0.10 (or anything above 6 percent), but v 
would not reject it at a = 5 percent (or anything below 6 percent). The value shown; 
the printout of .0000 means that ADV is significant at any a-level we might choo: 


Coefficients for Hop 
Scotch 


Equation Number 1 Dependent Variable.. PASS 
Variable(s) Entered on Step Number 


loo ADV 
Multiple R -96838 
R Square 293776 
Adjusted R Square -93297 > 
Standard Error -90678 


++++ HULTIPLE REGRESSION **** 
Equation Number 1 Dependent Variable.. PASS 


=------------------ Variables in the Equation ------------------ 
Variable B SEB Beta T SigT 
ADV 1.08132 „07726 -96838 13.995 -0000 
(Constant) 4.38625 -99128 4.425 -0007 
End Block Number 1 ALL requested variables entered. 


13.12.1 Using the data from 13.7.2, test the hypothesis at the 95 percent level that U: 
population regression coefficent = 0. 
Answer: t = 3.87 > 2.571. Reject null. 


E A Confidence Interval for B, 


Since b, = 1.08 is-only a point estimate of B,, we may desire a confidence interval 
the population regression coefficient. This can be accomplished via 


C.L for B, = b; E 1S, [13.32 


where the t-statistic has n — 2 degrees of freedom. 


4 
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MPLE A Confidence Interval for B, 


Despite the fact that all of the analysis so far has found that a relationship exists 
between advertising and the number of passengers, the head of the finance section still 
insists that further tests be completed. The hypothesis test showed with 99 percent 
confidence that the sample regression coefficient was significantly different from zero 
and allowed us to conclude that the population regression coefficent was not zero. If 
B, is not zero, the head of finance wonders what it is. He therefore orders that a 
confidence interval for B}, be constructed. 


SOLUTION: If we choose a 99 percent level of confidence for our test, we find 
C.L for By = b, £45, 

1.08132 + (3.012)(0.07726) 

1.08132 + 0.02328 

0.849 < B, < 1.314 


INTERPRETATION: We can be 99 percent certain that B, lies between 0.849 and 1.314, 
thereby indicating a positive relationship between advertising and the number of 
passengers. Finally, the head of finance is willing to accept the fact that there is indeed 
a relationship between advertising and the number of passengers who choose Hop 
Scotch. 


13.12.2 Using the data from 13.7.2, develop a 95 percent confidence interval for B,. 


Answer: 0.29 to 1.38 


@: ANALYSIS OF VARIANCE REVISITED 


The regression model presents a description of the nature of the relationship between 
our dependent and independent variables. We used a t-test to test.the hypothesis that 
B, = 0. A similar test can be conducted with the use of analysis of variance (ANOVA) 
based on the F-test. The ANOVA procedure measures the amount of variation in our 
model. As noted earlier, there are three sources of variation in a regression model: 
variation explained by our regression (SSR), variation that remains unexplained due to 
error (SSE), and the total variation (SST), which is the sum of the first two. These can 
be summarized in an ANOVA table, the general form of which is shown in Table 


13-4. 
= «_. of Sum of Degrees of Mean 
A General ANOVA Variation Squares Freedom Square A Fratio 
Table ş Š SSR MSR 
Regression SSR k MSR= F SE 
Eror SSE maken mse = -S 
Total < SST n-1 . 
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E Source of 


The ANOVA for Hop Variation 


Scotch Airlines 


The ratio MSR/MSE provides a measure of the accuracy of our 


model because i 


the ratio between mean squared deviation explained by our model and the m, 
squared deviation left unexplained. The higher this ratio, the more explanatory poy, 
our model has, That is, a high F-test signals that our model possesses signific. 
explanatory power. To determine what is high, our F-value must be compared wiy 


critical value taken from Table G. 


The computational formula for SSE was given by Formula (13-14). SSR can, 


calculated as 


ssr = SÈ 


SSx 


[13.33 


A! 
Sè 


Using our data for Hop Scotch, we have 
2 
SSE = SSy - (Sn 


SSx 
X (148.93333)? 
= 171.73333 — 13773333 
= 10.69 
and 
= (148.93333!? 
SSR = 137.73333 
= 161.0441 


A 


SST is found as the sum of SSR and SST, as shown in Table 13-5. The F-value carrie: 
1 and 13 degrees of freedom since it was formed with the mean square regression ant 


the mean square error as seen in Table 13-5. 


Sum of Degrees of Mean 

Squares Freedom Square Fratic 
Regression 161.04 1 161.04 196. 
Eror 10.69 13 0.82 
Total 171.73 14 


We can set a = 0.05 to test the hypothesis that B, = 0. Then Foòs,ı,ı3 = 46 
produces a decision rule stating that we should reject the null if our F-value exceed 
4,67. Since 196.39 > 4.67, we reject the null and conclude with 95 percent confidenc: 
that advertising has explanatory power. This is the same result obtained in our t-te 


using Formula (13.30). 


Actually, in simple regression, the F-test and the t-test are analogous. Both wi! 
give the same results, The F-value is the square of the ¢-value. In multiple regressio? 
the F-test produces a more general test to determine if any of the independ?” 
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variables in the model carry explanatory power. Each variable is then tested individ- 
ually with the r-test to determine if it is one of the significant variables. 

Display 13-3 provides the ANOVA printout from our SPSS-PC computer run for 
Hop Scotch. 


DISPLAY 13-3 


ANOVA for Hop 
Scotch 


Analysis of Variance 


Degrees of 
Freedom Sum of Squares Mean Square 
Regression 1 161.04408 161.04408 
Residual 13 10.68925 -82225 


Fe 195.85772 Signif F= .0000 


4 COMPUTER APPLICATIONS 


The advent of the modem computer has made regression and correlation much easier and more 
applicable to the problems commonly encountered in business situations. Without the com- 
puter, the calculations required by regression and correlation analysis would likely prevent 
their use in many cases. This section examines how the various computer packages can be used 
to execute regression and correlation analysis. 


SAS-PC 


Using the data for the relationship between sales and advertising for Hop Scotch Airlines as the 
example, the SAS program calling for the regression results is 


SAS Input = 


DATA; 
INPUT ADV PASS; 
CARDS; 
10 15 
12 17 
8 13 
17 23 
rest of data Lines go here 
10 15 
12 16 
PROC REG; 
MODEL PASS = ADV; 


As is usually the case, the variables are specified in the INPUT statement. The PROC 
REG; statement calls for the results of the regression package. The MODEL PASS = ADV; 
statement specifies the variables in the regression model. The dependent variable must be 
specified first, followed by an equal sign and the independent variable. The statement PROC 
GLM (for general linear model) can be used in place of PROC REG. The SAS output for this 
program is in the following display. 
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668 
SAS Output 
Dependent Variable: Pass 
Source DF Sums of Squares Mean Square E Valy 
Model 1 161.044 161.044 161.4! 
Error 13 10.689 -822 PR 
Corrected Total 14 0.00%, 
R-Square 
0.93776 

T for Ho: Std Er, 

Parameter Estimate Parameter =O PR> ITI of Estin, 
Intercept 4.386 4.425 g -0007 -99123 
Adv 1.081 13.995 -0000 -077% 


The coefficient and intercept values are shown, along with the coefficients of determi, 
tion and correlation. In addition, the standard error and t-statistic are reported. The probabi; 
value in the bottom right-hand comer of the output statement of PR > |T] is 0.0000 for, 
independent variable, which tells us that the probability that the t-value of 13.995 has occur. 
by chance is virtually zero. Thus, the model is significant at the 1 percent level. The section 
analysis of variance, as well as some of the other reported statistics, are discussed in ot: 


chapters. 
To produce a correlation matrix, enter 


PROC CORR; 
VAR PASS ADV; 


and, to produce a scatter diagram, use 


PROC PLOT; 
PLOT = PASS * ADV; 


To determine predictive intervals, use 


PROC REG; 
MODEL PASS = ADV/P CLI; 


‘The “P” following the slash asks for the predicted values of Y and the residuals; the “CL 


generates the predictive intervals for Y. If the line ID X; is added, SAS-PC will include ¢ 
value for X for each of the predicted Y-values, 


E srss-rc 
Using Hop Scotch Airlines data for sales and advertising, the SPSS program for the reg" 
results is 


d 
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SPSS-PC Input 


DATA LIST FREE / ADV PASS. 
BEGIN DATA. 
10 15 
1217 
813 
rest of data go here 
10 15 
12 16 
END DATA. 
REGRESSION VARIABLES = ADV PASS/ DEPENDENT = PASS/NETHOD ENTER. 


Display 13-] (shown earlier in the chapter) contains the printout. : 
If you desire an examination of the residuals, the command on the last line becomes 


REGRESSION VARIABLES ADV PASS/ DEPENDENT PASS/METHOD ENTER/CASEWISE ALL- 


The inclusion of CASEWISE ALL. calls for the residuals for all observations. 
The command 


REGRESSION VARIABLES ADV PASS/STAT C1/DEPENDENT PASS/METHOD ENTER- 
provides a 95 percent confidence interval for By and B,. 

EAEN VARIABLES ADV PASS. 

produces a correlation matrix. And 

PLOT PLOT = PASS BY ADV. 


produces a scatter diagram. 


E Minitab 


The Minitab program for a regression model using our Hop Scotch data is 


Minitab Input TE 
MTB> READ C1 AND C2 

DATA> 10 15 i 

DATA> 12 17 

DATA> 813 

DATA> 17 23 

rest of data 

DATA> 12 16 

DATA> END 

MTB> REGRESS C2 ON 1 PREDICTOR C1 
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The resulting Minitab output is 


The regression equation is 


PASS = 4.39 + 1.08 ADV sé PARA è 
ictor Coef tdev á 

Prentese 4.3863 0.9913 4.42 0.001 

ADV 1.08132 0.07726 13.99 0.000 

s = 0.9068 R-Sq = 93.8% R-sa(adj) = 93.3% 

Analysis of Variance 

SOURCE DF ss MS F i 

Regression 1 161.04 161.04 195.86 0.0, 

Error 13 10.69 0.82 

Total 14 171.73 

Unusual Observations 

Obs. ADV PASS Fit Stdev-Fit Residual St eRe 
10 10.0 17.000 15.199 0.302 1.801 ts 


R denotes an obs. witha large st. resid. 
MTB> 


Since the results are so similar to those obtained from the other packages, they should be s; 


explanantory. 
You can use the command 


MTB> REGRESS C2 ON 1 PREDICTOR C1; 
SUBC> BRIEF K. 


where K is either 1, 2, or 3. The higher the number, the more complete the printout. If you. 
no BRIEF subcommand, the default is 2. If you use 3, the residuals are reported. A subc 
mand of PREDICT produces the 95 percent confidence intervals for the value of Y given 
X-value, and the 95 percent predictive interval given any X-value. Thus, _ 


REGRESS C2 ON 1 PREDICTOR C1; 
PREDICT 10. 


reports these intervals, given X is equal to 10. 


|o] Computerized Business Statistics 


To execute regression and correlation using CBS, select Number 9, Simple Correlation * 
Regression, from the måin menu. Using our example from Hop Scotch Autines, you wè 


proceed as follows: 


CBS PROMPT YOUR RESPONSE 
Number of Data Points 15 
Alpha Error choi 
Variable Labels P A 
Independent Adv 
Dependent Pass Z 


a 
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Then enter the data, and select Number 7, Run Problem. The printout appears as shown here. 


Results 


BO Coefficient: 
B1 Coefficient: 


Mean of X (adv): 

Mean of Y (pass): 

Sum of Squares Regression: 
Sum of Squares Error: 

Sum of Squares Total: 


Coefficient of Determination: 
Correlation Coefficient: 
Standard Error Estimate: 
Standard Error B1: 


Computed t: 
Critical 
p value: 


4.3863 
1.0813, 


12.4667 
17.8667 
161.0441 
10.6893 
1471.7333 


0.9378 
0.9684 
0.9068 
0.0773 


13.9949 
2.1600 
0.0002 


Conclusion: B1 is statistically significant 


press J 


‘The first two lines report the values for the intercept and the regression coefficient. The next set 
of values constitute the ANOVA table. Other relevant statistics are reported in the remaining 


lines. 


You are given the options of first viewing the residuals, and then conducting forecast and 
interval analysis. With this option, you can provide a value for X at the prompt, and CBS will 
report the interval for the conditional mean and the predictive interval. 


SOLVED PROBLEMS 


1. Lee Iacocca’s Financial Inquiry Lee Iacocca, chairman and CEO of Chrysler Corpora- 
tion, expressed a concern regarding the company's high cost structure following the 
acquisition of AMC, and what he called “skimpy profits in the face of rising sales.” In 
1988 he ordered company executives to undertake a concerted study of Chrysler's cost 
structure as it related to reported sales, The data on K-car production shown here were 
collected, Company analysts used them to construct a regression model depicting the 
manner in which costs depended on production and subsequent sales volume. Costs were 
therefore taken as the dependent variable. Figures for costs are in units of $100,000, and 
values for sales are in millions of dollars. The data are monthly values for the company as a 


whole, 
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Bo Se e et 
Month  Costs() Sales (9) 
1 15.8 23 
2 12.3 18 
3 14.5 21 
4 15.7 23 
5 12.7 18 
6 13.5 19 
ri 13.7 20 
8 15.9 2 
9 13.7 19 
10 14.3 21 


The regression model appears as 


142.1 204 2,9199 4,194.0 2,033.89 | 
SSx = EX? — Gxy 
va (2042 
4,194 — SE 
=324 
sy = Ir - C 
Mi (142.1)? 
2,033.89 — “= 
= 14.649 f 


SSxy = IXY - Coen 
— (204)142.1) 
10 


= 21.06 ` 
Then 
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= 0.65 

by = ¥— bX 
= 14.21 — (0.65)(20.4) 
= 0.95 

f = 0.95 + 0.65X 


The regression model shows that as sales increase by one unit, or $1,000,000, costs will go 
up by 0.65 hundred thousand dollars, or $65,000. The constant of 0.95 indicates that if the 
firm shuts down and sales are zero, costs will equal $95,000. Those of you with a 
background in economics or finance may recognize the 0.95 hundred thousand dollars as 
the amount of fixed costs. 

Iacocca Examines the Closeness of the Relationship Iacocca is also interested in 
determining the strength of the relationship between sales and costs. 


2 = SS? 
SASSY) 


_ _ (21.06)? 
~ (32.4)(14.649) 
= 0.93 


r 


The coefficient of determination suggests that there is a strong positive correlation 
between costs and sales. In fact, 93 percent of the change in costs is explained by a change 
in sales. 
A Keynesian Consumption Function In his famous 1936 book, A General Theory of 
Employment, Interest and Money, the noted British economist John Maynard Keynes 
proposed a theoretical relationship between income and personal consumption expendi- 
tures. Keynes argued that as income went up, consumption would rise by a smaller 
amount. This theoretical relationship has been empirically tested many times since 1936. 
Milton Friedman, former professor of economics at the University of Chicago, and 
winner of the Nobel prize in economics, collected extensive data on income and consump- 
tion in the United States over a long period of time. Shown here are 10 observations on 
annual levels of consumption and income used by Friedman in his study. Using these data, 
derive a consumption function under the assumption that there exists a linear relationship 
between consumption and income, Figures are in billions of current dollars. 


Year Income" Consumption 
1950 2348 191.0 
1951 328.4 206.3 
1952 345.5 216.7 
1953 364.6 230.0 
1954 364.8 236.5 
1955 398.0 254.4 
1956 419.2 266.7 
1957 441.1 281.4 
1958 447.3 290.1 
1959 483.7 ` 811.2 


a, Since consumption depends on income, consumption is the Y,- or dependent, variable. 
Friedman sought a consumption function in the form 


Ĉ= b, tbl 
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where C is consumption and / is income. 


EX = 3,877.4 IXY = 984,615.32 ZY= 630,869.49 
SY = 2,484.3 EX? = 1,537,084.88 
\2 
ssr = 3x2 - GF 
4)? 
= 1,537,084.88 — Gena 
= 33,661.804 -> 
2 
sy = Eye - ZË 
2 
= 630,869.49 — (24843) 
= 13,694,841 
SSy = IXY — coen 
= 984,515.32 — G:8774)(2.484.3) 
aie 10 
= 21,352,838 
b, = SY 
ba SS 
_ 21,352,838 
33,661,804 
= 0.634 
bo = ¥- bX ý 
= 248.43 — (0.634)(387.74) 
= 2.603 


Therefore, 
Ĉ = 2.603 + 0.631 


These are not the same values Friedman found because we used only a very sm: 
portion of his data set. However, our model bears out Keynes’s theory. The coeffici: 
of 0.63 shows that for every $1 (or $1,000,000,000) increase in income, consump‘: 
will increase by 63 cents (or $630,000,000). Those of you who have taken : 
introductory macroeconomics course will recognize 0,63 as marginal propensity 

consume. The constant, or intercept term, of 2.603 is the level of consumption wt 
income is zero. Economists often argue that this economic interpretation of © 
intercept term is invalid since an economic system will always generate posi" 
income. The consumption function is therefore often graphed without the intercep!. 
in the figure. If 7 = 345.5, as in 1952, our model predicts 


C = 2.603 + 0.63(345.5) = 220.26 


Consumption was actually 216.7 in 1952, resulting in an error of $3.56 billion. 
b. The coefficient of determination is 
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C = 2.603 + 0.63(345.5) = 220.26 


Consumption 
6 =2.60+0631 


220.14 


345.5 Income 


Consumption was actually 216.7 in 1952, resulting in an error of $3.56 billion. 


ae SSO 
(SSxY(SSy) 


Kes (21,352.838)? 
= (33,66 1.804)(13,694.841) 


= 0.989 


A change in income explains over 98 percent of the change in consumption. Informa- 
tion concerning the values of bg, b,, and 7? are vital to those who advise Congress and 
the president on matters of national economics policy. 

4. Federal Reserve Actions to Stem Inflation After approximately six years of continued 
expansion, the U.S. economy began to show signs of inflationary pressures in the fall of 
1988. An article in a September issue of The Wall Street Journal described efforts by the 
Federal Reserve Board to cool these inflationary fires. This was to be done by tightening 

. the money supply through a rise in the discount rate commercial banks must pay to borrow 
from the Fed. In Febniary 1988; Manuel H. Johnson, vice-chairman of the Fed told an 
audience at a Cato Institute conference that Fed actions regarding the discount rate could 
be predicted on the basis of the federal funds rate, which is the fee banks charge each other 
for overnight loans. However, throughout the rest of 1988, Fed watchers argued that the 
federal funds rate was not serving as an adequate predictor of the changes in the discount 
rate, and that this poor performance as a predictor made it difficult for investors trying to 
predict what interest rate level the Fed would allow. 

Shown here are values for the federal funds rate and the discount rate from mid-1987 
to mid-1988, Do these data support the charges of the Fed watchers? 


Federal Funds Rate (%) Discount Rate (%) 


80 75 

7.5 75 

: 7.0 7.0 
4 s 65 65 
60 6.0 

6.0 55 

70 y 55 

6.0 = 55 

7.0 55 

75 55 

7.0 6.0 

83.0 745 
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Since Johnson argued that the federal funds rate could explain the behavior of the di,, 


rate, the federal funds rate is seen as the independent variable. A 
a. The nature of the relationship between the federal funds rate and the discount ra, 


be examined through regression and correlation analysis. 


EX = 83 ZY? = 469.25 
BY = 74.5 
IXY = 518.5 ¥=621 
IX? = 579 n=12 
SSx =-4.9166667 
SSy = 6.72917 
SSxy = 3.20833 
bı = 0.6525 
bo = 1.6949 


Therefore; 
Ê = 1.69 + 0.653 X 


The coefficient of determination is 
= (3.20833)? 
(4.92)(6.73) 
= 0.3111 - 
r = 0.56 
The Fed watchers are correct in their criticism of the federal funds rate « 
predictor of changes in the discount rate. Only 31 percent of changes in the disco- 
rate are explained by changes in the federal funds rate. 
b. A measure of goodness-of-fit which reflects the ability of the federal funds rat: 
predict the discount rate is the standard error of the estimate. 
The standard error of the estimate is 
(SS) 
SSx 
_ 3.208)? 
6.7292 45166 
= 4.63033 


— 4.63033 
MSE = 10 


SSE 


Ssy - 


= 0.463033 
Se = V0.463033 
= 0.6808 


Typically, the estimate of the discount rate is in error by 0.68 of a perce” 


point, 

C. `A test of the significance of the correlation coefficient would prove useful 2! ‘ 
point. Set the level of confidence at 95 percent. With 10 degrees of freedor! * 
critical value for 1 is therefore +2,228. . 


= 
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‘The hypotheses are 


Ho p =0 
Hyp #0 


Decision RULE Reject Ho if t < —2.228 or if t > 2.228. Do not reject Ho if 
—2.228 < t < 2.228. 


Ea 

S, 

ee ES 
~ Mr = An- 
is 0.56 

~ Va = 030/10 


_ 056 
~ 0.2627 


= 2.13 


‘The null hypothesis cannot be rejected. Despite the sample finding of a positive 
relationship between federal funds rates and the discount rate, the hypothesis that 
there is no correlation cannot be rejected. The sample correlation coefficient is not 
significant at the 5 percent level. 

A test of the significance: of the sample regression coefficeint of b, = 0.6525424 is 
also wise. The test will be conducted at the 99 percent level. With 10 degrees of 
freedom the critical t-value is +3.169. 


Hg: B, = 0 
HiB #0 


Decision Rute Reject Ho if t < —3.169 or r > 3.169. Do not reject Ho if 
—3.169 < t < 3,169. The test requires 


ee 
SE 


where 


So = Vs 
= 0.681/V4.92 = 0.307 
— 0.652542 
0.307 
= 2.126 


The hypothesis that B, = 0 cannot be rejected, The value for b, is not signifi- 
cantly different from zero at the 1 percent level. There is little or no confidence in the 
federal funds rate as a predictor of the discount rate. Investors would be unwise to rely 
on the federal funds rate as an indicator of what the discount rate and other interest 
rates will do. 
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5. A Further Examination of the Discount Rate Based on the results of Prob), 
professional bankers and investors can find little comfort in the ability of the federal; 
rate to predict the discount rate. Using the regression model to develop a point estim, 
the discount rate does not appear wise. To further examine the relationship between 
two variables, if any exists, we can calculate interval estimates of the discount rate 

People employed in banking and finance would be interested in an interval esi, 

for the mean value of the discount rate if the federal funds rate was held consta, 

several months. This is, of course, an interval estimate of the conditional mean ç 


discount rate: 


a. 


C.I. for pyx = ¥ + Sy 


and requires calculation of the standard error of the conditional mean, Sy, and Ý + 
point estimator of the discount rate. Since the federal funds rate seemed to r 
around 7 percent quite often, it is at this rate that the confidence interval wi 


calculated. 
To calculate Sy and ¥, we have 


Also, 


~ 


= bo + b,X : 
= 1.6949 + 0.6525424(7) 
6.2627 


If the interval is calculated at a 95 percent level of confidence, the critical t-valt: 
tosn-2 = £2.228. We then have aye y 


CL for py), = Pty 
= 6.2627 + (2.228)(0.1982) 


5.82 < py], < 6.70 


Bankers can be 95 percent confident that if the federal funds rate is 7 percent 
several months, the mean discount rate they miust pay to borrow money from the’ 
will fall between 5.82 percent and 6.70 percent, Their plans and policies car 
formulated according to this expectation. 

b. If a banker wished to make plans for next month, he or she would-be interested in “ 
the discount rate might be in that month given that the federal funds rate w> 
percent. The banker would therefore calculate a predictive interval for next mon’ 
follows: 


Cl. for Y, = Pts, 


This requires calculation of the standard error of the forecast, S,, Assum: 
percent level of significance and a federal funds rate of 7 percent, the banker 


Proceed as follows: 
= | 1 ¢ ? 
5,, = Sejl + me 


= 0.70927 
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Since ? = 6.2627, we have 


C.I. for Y, = 6.2627 + (2.228)(0.70927) 
4.68 < Y, < 7.85 


The banker could formulate plans for next month’s operations on the realization 
that he or she could be 95 percent confident that if the federal funds rate was 7 percent, 
the discount rate would fall between 4.68 percent and 7.85 percent. This is a wider 
range than that found for the conditional mean of the discount rate. 

Jt would certainly appear that Johnson's statement concerning the use of the 
federal funds rate to estimate or predict the discount rate is questionable. The 7? is 
rather low, and the tests for significance of p and B, suggest that the hypotheses p = 0 
and B, = O cannot be rejected at any acceptable levels of significance. 

In all fairness, it might be argued that the federal funds rate should be lagged one 
month. That is, the discount rate in any month (time period £) is a function of the 
federal funds rate for the previous month (time period t — 1). This would allow the 
Fed time to adjust the discount rate to last month’s federal funds rate, since the Fed 
cannot respond immediately to changes in the federal funds rate. This is expressed as 


DR, = f(FF,1) 
._ where DR is the discount rate and FF is the federal funds rate. This lagged model 
yields 
Ê = 06 + 0.8X 


with 7% = 60 percent and Se = 0.47. This represents a major improvement over the 
naive model, which does not include the lagged variable. 


‘6. The Effect of Productivity on Real GNP A recent issue of Fortune magazine reported on 
the relationship between worker productivity and rates of change in the nation’s level of 
output measured in real terms. The message was that the increase in productivity during 
the 1980s could serve as an explanatory factor for GNP growth. With both productivity 
growth and changes in GNP measured in percentages, and GNP as the dependent variable, 
annual data for that time period can be summarized as follows: 


EX = 32.5 EY? = 483.72 
ZY = 62.2 -n=9 
EXY = 255.4 =X? = 135.25 


The model is 
Ê = 069596273 + 1.721118X 


indicating that if productivity increased one percentage point, real GNP will increase by 
1.72 percent. The r° is 0.98407, and Se = 0.35. 

For the purpose of formulating national tax policy, which some supply-side econo- 
mists argue has a direct impact on worker productivity, Washington planners tested the 
significance of both the sample correlation coefficient and the sample regression coeffi- 
cient. Each proved significant at the 10 percent level. z 

The same planners then requested a confidence interval for each population coeffi- 
cient at the 10 percent level: 


C.I. for B, = bi + tS,, 
Se 


= 0.08275 
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C.I for B, = 1.72 + (1.895)(0.08275) 
1.56 < By < 1.88 


The planners can then base the formulation of national tax policy on the condition that, th 
can be 90 percent certain that the population regression coefficient is between 1.56 Py 


1.88. 


“ 


H: CHAPTER GHECKLIST 
Atter studying this chapter, as a test of your knowledge of the essential points, can you 


— Distinguish between the dependent and independent variables? 

Explain what is meant by ‘minimizing the sum of the errors squared’? 

Calculate the sums of the squares and cross-products? 

Calculate and interpret the regression model (equation)? 

Explain how the Y-values are normally distributed around their mean, using a grap 

the explanation? 

— Discuss the assumptions of the OLS model? 

— Calculate and interpret the standard error of the estimate with the aid of a graph? 

— Calculate and interpret the coefficient of determination and the correlation coeffici: 
again using a graph? 

—— Explain what is meant by the measures of goodness-of-fit? 

—— Calculate and interpret confidence intervals for 

e the conditional mean value of Y? 

© the predictive interval? 

e Bo? 

=p 

Complete hypotheses tests for p and B,? 

Conduct and interpret the analysis of variance for a regression model? 


G@ or SYMBOLS AND TERMS 
Y Generally the dependent variable in a regression statement ~~ 


x Generally the independent variable in a regression statement 

A The estimated value for Y based on our regression model s: 
The standard error of the estimate, which measures the average amount by whi“ 
the actual observations for the dependent variable vary from the regression line 

The conditional mean, which is the mean of the population of all ¥-values giv“ 
some specific X-value - 

Sy The standard error of the conditional mean 


Sp The standard error of the forecast 
r The coefficient of determination 
By The correlation coefficient 
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The coefficient of nondetermination 


The population correlation coefficient 


The standard error of the sampling distribution of the correlation coefficient 
‘The standard crror of the regression coefficient b 


OF FORMULAS 


113.3) 


[13.8] 
013.9] 


(13.10) 


013.11) 


(13.12) 


[13.13] 


[13.14] 


[13.15] 


[13.16] 


[13.22] 
[13.24] 
[13.25] 
[13.26] 


113.27) 


[13.28] 


Y= by + bX 
ss = 3x2 — EX! 
n 
sy = zr - GE 
ssw = sxy - ED 
SSxy 
bi = ‘gsx 
bo = Y— bX 


dss _ (SSxy)* 
SSE = SY — SE 
_ SSE 
MSE = = 


Se = VMSE 


_ _WSxy? 
© (SSx)(SSy) 
-xy 


CL. for pys = Ê E Sy 


PR T (x, — X}? 
5, = sefi E 


CL for Y, = Ê £15, 


als 
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Formula for a straight line showing 
the intercept, bp, and the slope, b,. 


Sum of the squares for X. 
Sum of the squares for Y. 
Sum of the cross-product. 


The slope of the regression line mea- 
sures the unit change in Y given a 
one-unit change in X. 


The intercept is the value for Y when 
X is set equal to zero. 

The standard error of the estimate is 
the measure of the dispersion of the 
Y-values around their mean. 


Error sum of squares. 
Mean square error. 


‘The standard error of the estimate is 
the measure of the dispersion of the 
Y-values around their mean. 


The coefficient of determination mea- 
sures the portion of the change in Y 
explained by a change in X. 

Standard error of the conditional 
mean, 


Confidence interval for the condi- 
tional mean of Y given some X-value. 


Standard error of the forecast. 


Predictive interval for a single value 
of Y given some X-value. 

The t-value to calculate the hypoth- 
esis about the population correlation 
coefficient. 
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The standard error of the ,,. 
distribution of r measures th, 


[13.29] a eas 
tion in r from one sample to, 


The ¢-value used to test the hyp, 
about the population regressio, 
ficient. 
The standard error of the reg, 
_ _ Se coefficient measures the varia, 
03.31) 5s, = VSSx the coefficient from one sample 
next, 
C.L. for B, = b, £15, Confidence interval for the regy 
113.32] 8 ; zm coefficient. 
— BÈ ‘The regression sum of squares į 
113.33] SSR = 55x for various regression calculati, 


[13.30] 


ER EXERCISES 


You Make the Decision 
1. The CEO for the Acme Trucking Company is concerned about recent trends in bus: 
performance. Profits are falling, and the stockholders are calling for his resignation 

comes to you for some answers regarding this predicament. You collect data on z 

driven by the firm’s trucks and resulting revenues the firm eamed. The CEO asks ift 

might be a relationship between these two important variables and to what degree r 

driven might explain revenues. 

a. What do you do? How do you respond? 

b. The CEO wants to know if your work will allow him to predict future revenues. E 
could you use your statistical results to provide an estimate of revenues in ther 
future? 

c. Your results include the regression model 


Rév = 23.2 + 523.6MD (where Rev is damed 
revenues and MD is miles 
driven by the firm’s trucks) 


with a correlation coefficient of 0.78. How would you interpret these results’ 
d. Your CEO feels confident in your statistical study given the fact that, he conclu 
MD causes 78 percent of the change in Rev. How do you respond? 
You tell the CEO that the regression line you have computed is a mean line and th 
is the line of best fit. Having no knowledge of statistical analysis, he asks 3% 


explain. What do you tell him? 


E Conceptual Questions 
2 What is meant by ‘minimizing the sum of the errors squared™ in your model Ñ , 
trucking firm in Problem 1? A 


i 
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3. In what way might autocorrelation and heteroscedasticity present a problem in your 
regression model? 

4. What is the difference between regression and correlation? 

5. Identify the dependent and independent variables in each case: 

Time spent working on a term paper and the grade received. 

Height of a son and height of his father. 

A woman's age and the cost of her life insurance. 

Price of.a product and the number of units purchased by an individual. 

Demand for a product and the number of consumers in the market. 


PAD ES 


Om Problems 


6.) The residents of a small town are worried about a rise in housing costs in the area. The 
mayor thinks that home prices fluctuate with land values. Data on 10 recently sold homes 
and the cost of the land on which they were built are seen here in thousands of dollars. 
Identify the dependent and the independent variable. Construct and interpret the regres- 
sion model. On this basis, does it appear that the mayor is correct? 


Land Values Cost of the House 
70 67.0 
69 63.0 
55 60.0 
37 54.0 
59 58.0 
38 36.0 
89 76.0 
9.6 87.0 
99 . 890 

10.0 92.0 


7. Barney Barnacle rents fishing boats at Dingy’s Dock in Tampa, Florida. Concemed about 
the number of boats he rents, Bamey collects: data on the daily temperature and the 
number of rentals as seen here. 

a. Which is the dependent variable? 

b. Calculate and interpret the regression model. What does the regression coefficient tell 
Bamey about the effect of temperature on rentals? 

c. What would happen to rentals if the temperature went up one more degree? 


Temperature Number of Boats Rented 
70 ts 12 
72 14 
69 “ 
65 8 
79 2 
63 12 
55 7 
69 13 
75 12, 
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8, The student government at the local university is trying to determine if the admi, 
price to the game room in the student center has an impact on the number of students, 
use the facilities. The cost of admission and the number of students who enter the, 
are recorded for 12 successive Friday nights and shown here. Construct and interpre, 


regression model, 

OSEE eee 
Price Number of Tickets 
$1.25 95 

1.50 83 

1.75 75 

2.00 72 

210 69 

1.00 101 

1.00 98 

1.50 85 

2.00 75 

2.50 65 

1.10 98 

1.50 86 


ee 


9. As chairman of the Federal Reserve System, Alan Greenspan has the responsibility 
controlling the nation’s money supply. His actions impact directly on mortgage re 
people must pay to buy houses. In 1989, his staff was instructed to examine the effec: 
mortgage rates on the number of houses sold. A regional center in Lexington, Kentuc: 
gathering data for the study provided the information shown here. Housing units ar: 


hundreds.” y 
, 
Year Housing Units Sold Mortgage Rate 
1971 20 12.10 
1974 17 13.50 
1976 13 14.95 
1978 14 13.75 
1980 15 12.95 
1982 14 12.50 
1984 15 10.10 
1986 16 9.82 
1988 17 9.50 
rae 
a. Determine the dependent and independent variables. x 
b. Assuming a linear relationship exists between these two variables, construct ‘ 


regression model. 

Interpret the constant and the coefficient. 

What would be the leyel of units sold if the mortgage rate was 11.5 percent? 
What would happen ‘to the number of units if the rate increased by 2 percer'=: 
points? 

10. Compute and interpret the standard error of the estimate for the previous problem. 


SAS 
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11. Data for the consumption of beef products as reported by the agricultural division at 
Florida State University are shown for 10 Florida counties. Figures are for May 1988 and 
are on a per-capita basis, 


County Consumption (pounds) Price (per pound) 
Dade 65 319 
Taylor 67 2.99 
Broward 64 3.22 
Leon 64 334 
Duval 69 2.85 
Alachua 70 273 
Dixie 65 3.04 
Okaloosa 65 3.09 
Manatee 63 2.88 
Orange 64 2.91 


The governor of Florida requests that state economists in Tallahassee estimate the linear 

demand curve for beef products, 

a. Which is the dependent variable? (Hint: A demand curve can be expressed with 
either price (P) or quantity (Q) in the dependent role. However, when, as in this case, 

` the analysis seems to view the issue from the standpoint of the consumer rather than 

the producer, it is customary to argue that Q is a function of P.) 

b. Estimate the linear demand curve, using OLS. 

c. Interpret the results. ` 

d. What economic principle dictates that the regression coefficient should carry a 
negative sign? 

12. Does studying really pay off? To answer this question, a curious student in a statistics 
class asked 10 students how may hours they studied for the most recent test and the grade 
they received. The data are recorded here. 

a. Based on the coefficient of the regression model (or more aptly, its sign), what do 
you conclude? 
b. If you study one more hour, what will happen to your grade according to the model? 


Grade Hours 
69 25 
92 26 
32 12 
92 32 
90, 29 
30 10 
87 21 
88 27 
4 15 
30 18 
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13. To reduce crimes, the president has budgeted more money to put more police on oy, 
streets, What information does the regression model offer based on these data fo, 
number of police on patrol and the daily number of reported crimes? Use the for, 
that illustrate that the OLS mode! is indeed based on deviations from the mez, 


calculating 
SS =YX-XF  SSy= UY - Ý? 
SSry = XX - KY - Y) 


Police Number of Reported Crimes 

13 8 

15 9 
23 12 
25 18 

15 8 
10 6 

9 5 
20 10 


14. Asa safety feature, Gulf Leisure limits the horsepower in the jet skis it rents to tourist 
Tampa, Florida. The intent is to prevent inexperienced drivers from going too fast. 
these data suggest that controlling horsepower will accomplish this goal based on 
regression coefficient? Fully interpret the results. 


Speed (mph) Horsepower j 

50 35 

: 35 20 
- 45 35 

47 40 

60 50 

65 60 

72 65 

37 30 


15. Aùnt Bea wants to get more ‘yield from her Big Boy tomato plants this summer 
increasing the number of times she uses fertilizer. Based on the data shown here, docs“ 
coefficient for the regression model suggest this is possible? Use the formulas © 
illustrate that the OLS model is indeed based on deviations from the mean by calcul 


SS =TX-XPR  SY=XY- Y? 
SSry = X(X — HY - Y) 


Use of Fertilizer - Yield (pounds) 
4.00 12.00 
9.00 20.00 
5,00 15.00 
8.00 17.00 
2.00 7.00 
a p 
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16. Twelve school districts in the Chicago area were interested in whether the rising property 


17. 


tax rates could be associated with the number of pupils in a classroom in the local schools. 
Does this seem to be the case based on the data shown here? 


Tax Assessment Rates Pupils per Class 
1.20 32 
1.20 ` 36 
1.10 25 
1.30 20 
1.10 39 
1.20 42 
1.30 25 
1.30 21 
1.20 35 
1.40 16 
1.40 39 
1.30 a7 


a. If itis thought that more pupils require higher taxes, which is the dependent variable? 
Calculate and interpret the regression model. Do larger classes seem to be associated 
with higher taxes? 

b. Calculate and interpret the coefficient of determination and the correlation coeffi- 
cient. Does it seem this model is useful? 

c. Calculate and interpret the standard error of the estimate. 


Based on figures released by the Internal Revenue Service, a national group of citizens 
has expressed concem that the budget for the IRS has not been used effectively. The IRS 
argued that an increase in the number of taxpayers filing retums explains the budget 
problems. Relevant data are provided here, 


Tax Returns (in millions) IRS Budget (in billions of dollars) 
116 $6.7 
116 ~ 62 
if 118 54 
118 59 
120 37 
117 59 
118 47 
121 42 


a. Construct the regression model, Does the IRS argument seem plausible? 
b. Calculate and interpret the coefficient of determination. 
c. Calculate and interpret the standard error of the estimate. 


18. -It was recently reported in Financial Weekly that E. F. Hutton was interested in the 


relationship between a person's income and the amount of money they had invested in the 
stock market. Fifty individuals were randomly selected on the presumption that an 
individual's investments in the stock market are influenced by his or her income. Using 
regression and correlation analysis, do these data suggest such relationship? (Data arein 
thousands of dollars.) 

a. Identify the dependent and independent variables. 
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b. Calculate and interpret the results of the regression model. 
cc. Are the data cross-sectional or time-series? 


IX = 9,385 Ex? = 3,025,553 
XY = 988.1 ZY = 32,224.51 
IXY = 303,4713 


19. Yachting, a magazine for the nautically-minded, carried an article warning wee}. 
sailors that ‘‘gusty winds can cause steering variation by several degrees.** To deter. 
this relationship, data were collected on wind speed and steering error. 


Bet As, E e 
Wind Speed Degrees Off Course 
40 15 
25 12 
36 13 
45 24 
15 8 
10 7 $ 
9 7 
56 22 
36 14 
7 15 


Compute the regression model and interpret the results. 

Plot the scatter diagram and interpret it. 

Plot the residuals. What generalization can you make? 

Compute and interpret the coefficient of determination and the correlation cot 

cient, 

20. A principal theory in finance holds that as bond yields rise, investors take funds out of: 
stock market, causing it to fall, and buy debt securities (bonds). Weekly data, which - 
the federal funds as a proxy for bond yields, as reported by the Commerce Departmer: 
the winter of 1988 are shown. es 


RP SR 


Week Dow Jones Federal Funds Rate (%) 
1 2,050 68 
2 2,010 6.95 
3 1,983 73 
4 2,038 75 
5 1,995 WW oN 
6 1,955 -77 à 
7 1,878 8.3 
8 1,802 87 


a, Assuming the federal funds rate affects the stock market, identify the depen 
variable. 
1b. „Do these data tend to corroborate that financial theory? In what manner and to” 
extent would interest rates serve as a forecasting tool for the stock market? 
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Economists often argue that changes in the real GNP affect returns on mutual funds. Data 
were collected and displayed as seen here, 


Percent Change in Real GNP Mutual Funds Returns (%) 
1.3 21.0 
15 25.0 
02 18.0 

-1.1 7.0 
19 25.0 
21 21.0 
26 31.0 
24 29.0 
31 33.0 
27 32.0 


a. What does the regression coefficient suggest? 
b. Does the coefficient of determination support this claim? 
c. Calculate and interpret the standard error of the estimate. 


22. A popular financial theory holds that there is a direct relationship between the risk of an 


investment and the retum it promises. A stock’s risk is measured by its B-value. Shown 
here are the returns and B-values for 12 fictitious stocks suggested by the investment firm 
of Guess & Pickum. Do these data seem to support this financial theory of a direct 
relationship? 


Stock Return (%) B-Value 
1 5.4 1.5 
2 89 19 
3 23 1.0 
4 15 0.5 
5 37 1.5 
6 82 1.8 
7 53 1.3 
8 05 —0.5 
9 13 0.5 

10 59 1.8 
W 68 1.9 
12 72 1.9 


Investors typically view return as a function of risk. Use an interpretation of both the 
regression coefficient and the coefficient of correlation in your response. 


Calculate and interpret the standard error of the estimate for Problem 22. 


City officials in New Orleans have long argued that the revenue to business during Mardi 
Gras can be predicted on the basis of the tons of trash swept up after each year’s 
celebration, Data taken from city records for the last 10 years are given here. 
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City Revenues (in millions of dollars) 


Trash (in tons) 
2.10000 21.00000 
3.50000 2.00000 
1.10000 a 
50000 1 

3.60000 25.00000 
2.10000 21.00000 
3.50000 31.00000 
3.40000 29.00000 
3.30000 33.00000 

32.00000 


a. Calculate the regression model. Does it seem a relationship exists? 
i b. Calculate and interpret the standard error of the estimate. 
c.. Calculate and interpret the coefficient of determination and the correlation ¢, 


cient. 


25. Using the data from the previous problem, 
a. Test the hypothesis for the regression coefficient at the 1 percent level. What i: 


conclusion? 
b. Test the hypothesis for the correlation coefficient at the 1 percent level. What is 
conclusion? 
26. A local landscaper wishes to build a regression model to explain the revenue rec: 
from work on residential lawns using housing starts as the independent variable. Ds: 
90 weeks were collected and tabulated as seen‘ here. 


” EX = 5,013 =X? = 313,857 IXY = 837,045 
EY = $13,410 Er? = 2,244,294 


a, Compute the regression model and interpret the results. 
b. Calculate and interpret the standard error of the estimate. 
c. Calculate and interpret the coefficient of determination and the correlation cx 


cient. 
27. Using the data from the previous problem, ` 
a. What would you estimate revenue to be if housing starts equal 50? 
b. What is your 99 percent confidence interval for the conditional mean of reven:: 
starts equal 50? Interpret your answer. How does it differ from the previous ans: 
What is your 99 percent interval estimate of the predictive interval? Interpret 
answer. How does it differ from the two previous answers? 
28. Economic theory holds that as interest rates go down firms are able to invest m 
capital equipment. Monthly figures for the interest rate and levels of new c? 
investment in billions of dollars are shown in the table. 


G 
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Month Interest Rate Capital Investment 
January 10.0 10 
February 95 “1 
March 90 12 
July 75 16 
August 7.0 17 
September 65 18 
October 60 19 
November 55 20 
December 50 21 


a. Calculate the regression model. 
b. Plot the data and the regression line. Does the model support the theory that lower 
interest rates are associated with higher levels of investment? 
ce Calculate and plot the residuals, Does there appear to be any autocorrelation? 
29. A recent issue of Marketing News claimed.that the number of calls received by telemar- 
keters could be explained by television advertising time (measured here in minutes). To 
test this assertion, 56 weeks of data were collected, summarized here. 


EX = 372 EX? = 3,590 EXY = 7,565 
2Y= 1471 EP = 43,054 


a. What does the regression model reveal about this relationship? 
b. Calculate and interpret the standard error of the estimate. 
c. What is the strength of the relationship? 
£ 30.) Many news accounts’ have linked drug sales in our nation’s cities with the level of 


S7 poverty. Annual reported sales based on police reports and the percentage of people living 
under the poverty line were collected for 80 cities. The data summarize as 


IX =939.2 EX? = 11,182.24 = LX¥ = 35,038.72 
XY = 2,963 Xr? = 112,779 


a. What is your estimate of the number of drug sales in a city with a poverty rate of 
13 percent? 

b. What is your interval estimate of the drug sales in many cities that all have a poverty 
rate of 13 percent? Set alpha at 5 percent. 

c. What is your interval estimate of the number of drug sales in a city with a rate of 
13 percent? Set alpha at 5 percent. 

d. What is the difference among the three estimates above? 

31. Thirty customers for Neiman-Marcus in Dallas were polled regarding the number of 
members in their immediate family and the dollar purchases per visit. The store wanted to 
know if family size might explain expenditures. 


Summation of family members (F) = 80 
Summation of purchase (P) = $1,698 
EFP = 3,464 

EF? = 296 

=P? = 6,366,916 
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a. Determine the dependent variable. 

b. Calculate the regression model. 

c. Interpret the parameter estimates. 

d. What is the strength of the relationship? 

In the effort to evaluate the educational efforts of our university system, the 
Department of Education asked deans of MBA programs to rate on a scale of 10 toy 
quality of education received at several of the nation's top schools. The results , 
reported in the Chronicle of Higher Education. Data were also collected on the av, 
beginning salary for graduates of each school. Fifteen of these observations are s, 
here. Do the data suggest that a school’s reputation has any relationship to the begir, 


salaries of its graduates? 


32, 


Pane aber eae, a re 
School Rating Salary (in-1,000’s) 
1 55 
2 45 100 
3 4 110. 
4 31 65 
5 3 675 
6 52 103 
7 59 -120, 
8 43 89 
9 28 60 
10 61 125 
" 49 100 
12 20 40 
13 73 150 
14 69 140 
15 79 160 


Assuming firms extend salary offers on the basis of a school’s reputation, identify 


dependent variable. 
Compute and interpret the regression equation. 
Compute and interpret the coefficient of determination and standard error of 


estimates. y 
d. Compute a confidence interval for the mean salary of many graduates from M: 


schools with a rating of 50. Set « = 10 percent. : 
Compute a confidence interval for the salary of one graduate from. a school wi: 


50 rating. Set a = 10 percent. 
f.. Why is your answer to Part (d) narrower than the answer you got in Part (e)? 


In a recent speech before congress, Bob Dole, minority leader in the-Senate, stated! 
the number of immigrants entering the United States could be explained by econ” 
conditions in their country of origin. Forty-four, nations were examined to determin: 
number of immigrants (/) and the prevailing economic situation as measured ty 
economic index (E7). The results are shown here. 


S(E) = 2,421.1 S(EN? = 138,754.11 
X(EN() = 186,187.1 
BI = 3,476 X(N? = 281,556 


af 


e 


33. 
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Compute and interpret the regression model. 

Judging by the coefficient of determination, does it seem Dole is correct? 
Test the hypothesis for the regression coefficient at the 5 percent level. 
Test the hypothesis for the correlation coefficient at the 5 percent level. 


Using the data from the previous problem: 

a. what is your 95 percent interval estimate for the number of immigrants from many 
countries with an X-value of 50? Interpret the answer. 

b. whatis your 95 percent interval estimate for the number of immigrants from Slobovia 
if it has an X-value of 50? Interpret the answer. 

One hundred students are examined to determine if their entrance exam scores are good 

predictors of their GPAs. Results were recorded as 


RASA 


BX = 522 IXY = 17,325 
ZY = 326 EX? = 28,854 
EY? = 10,781 


Test scores ranged from 0 to 10, and the GPA is based on a 5-point system. 

a. The dean of a local state university wishes to estimate the mean GPA of many 
students who score a 6.5 on the entrance exam. She wants to be 99 percent confident 
of that interval. 

b. A father of one prospective student wishes to estimate the GPA for his son who 
scored a 6.5 on the exam. He also wants a 99 percent confidence interval. 

c. Why is there a difference between the answers you got in Parts (a) and (b)? 

Emergency service for certain rural areas of Ohio is often a problem, especially during the 

winter months. The chief of the Danville Township Fire Department is concerned about 

response time to emergency calls. He orders an investigation to determine if distance to 

the call, measured in miles, can explain response time, measured in minutes. Based on 37 

emergency runs, the following data were compiled. 


EX = 234 =x? 
ZY = 831 zy? 
IXY = 5,890 


1,796 
20,037 


1 


a. What is the average response time to a call eight miles from the fire station? 

b. ‘How dependable is that estimate, based on the extent of the dispersion of the data 
points around the regression line? 

Referring to Problem 36, at the 90 percent level of confidence what can you say about the 

significance of the sample 

a. regression coefficient? 

b. correlation coefficient? 

Referring to Problem 36, with 90 percent confidence, what time interval would you 

predict for a call from Zeke Zipple, who lives 10 miles from the station? 

In reference to Problem 36, with 95 percent confidence, what is the average time interval 

that you would predict for several calls 10 miles from the station? 

Using the data from Problem 36, the fire chief is interested in a 95 percent confidence 

interval estimate of the population regression coefficient. Interpet your résults for the 

chief. 


Scanned with CamScanner 


694 Chapter Thirteen Simple Regression and Correlation Analysis 


41. 


42. 


43. 


45, 


Manufacturers are always trying to improve the efficiency rating of their employy 
recent report in the Journal of Midwest Marketing detailed the efforts of one comp; 
improve employee performance through additional on-the-job training. Forty emph 
were given additional hours of OJT and the resulting net changes in their efficiency; 
(as measured in efficiency points) were recorded. The results are 


IX = 124.5 EX? = 475.75 
XY = 3528 ZY? = 3,384 
IXY = 1,246.4 


a. Identify the dependent variable. 

b. Provide a point estimate of the net change in efficiency for someone who s; 
10 hours in OJT. Interpret your answer. 

Referring to Problem 41, the manager of the company plans to provide all his emplo; 

with 10 hours of additional OJT if he can be 90 percent sure that it would increase 

efficiency rating by at least 30 points. Based on these sample results, should he prec. 

with his plans? 7 

The manager in Problem 41 must prove to his boss that the relationship between OJT : 

improved performance docs exist. Despite these sample results, the boss is skeptical. 1 

manager points to the coefficent of correlation as proof of the training’s benefit. Howe; 

his boss feels that this is just an aberration of the sample and that the relationship is: 

that strong in general. How might the manager support his defense of the OJT plan, bz 

on the value of the sample correlation coefficient, keeping in mind that the boss is g. 

skeptical? 

Professor Smith is worried that the amount of time he spends with his model train: 

detracting from his ability to grade term papers. Data for 23 days for the number 


minutes spent with his favorite hobby (T) and the number of papers graded (G) reveal: 


following: 
IT = 1,072 ET? = 51,234 FTG = 17,842 
XG = 386 IG? = 6,566 


a. Does the coefficient for the model support the professor’s fear? 

b. How strong is the relationship? 

c. Construct and interpret the 99 percent interval for the conditional mean of Y ií 
is 40. 

d. Construct and interpret the 99 percent predictive interval if X = 40. 

Using the information from the previous problem, $ 

a. Test the hypothesis regarding the regression coeffiċient at the 5 percent level. 

b. Test the hypothesis regarding the correlation coefficient at the 1 percent level. 

Given these monthly data, what conclusion can you Teach from a regression equa 

regarding the ability of the unemployment rate to explain the number of people fi! 

claims for welfare? 
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Unemployment Rate Claimants 
25 12 
3.6 18 
5.4 21 
65 22 
27 10 
46 20 
9A 24 
29 “ 
87 23 
74 22 
64 23 
18 8 


a. Construct and interpret the regression equation. 

b. Calculate and interpret the coefficient of determination. 

c. Calculate and plot the residuals. What generalization can you make? 

Using the data in the previous problem, 

a. Construct and interpret the 99 percent interval for the average number of claimants if 
unemployment is 9 percent for many months. 

b. Construct and interpret the 99 percent interval for number of claimants if it is 


9 percent one month. 
Using the data for the previous problem, construct and interpret the analysis of variance 
table. Set alpha at 1 percent. 


Annie thinks that when other people in her home town have garage sales, this detracts 
from the number of customers she can draw to her weekly sale. She collects data for 12 


sales to see if a relationship exists. 


Total Sales in Town _ Annie's Customers 


25 7 
32 92 
62 112 
18 79 
21 75 
54 110 
62 124 
54 120 
36 98 
24 7 
21 7 
31 74 
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a. Compute the model and interpret for Annie. s Oo 
b. What does the correlation coefficient reveal? The coefficient of determination) 


c. Calculate and interpret the standard error of the estimate. 


50. Using the data from the previous problem, 


51. 


53. 


@ Empirical Exercises 
54. a. Identify any two variables you feel might be related that pertain to your fellc- 
students, for example, their GPA and the number of hours they normally spe: 


55. 


a. Construct and interpret the 99 percent interval for the average number of custome, 
the number of sales is 20 for many weeks. 

b. Construct and interpret the 99 percent interval for number of customers if there arg 
sales this week. 

Using the data from Annie's problem above, construct and interpret the analysis 

variance table. 

The Harvard Business Review discussed the efforts by a trucking firm to reduce deliv, 

time by requiring employees to study city maps and learn the road system. Study time; 

delivery time were both measured in hours. Fifteen employees were surveyed regard 

the time they studied the map and the elapsed time of their last delivery. The results, 


Ix = 364 EX? = 90.04 
LY = 22.6 EY? = 41.78 
EXY = 51.37 


a. Is the dependent variable study time or delivery time? Explain. 

b. Compute the regression model. 

c. Compute and interpret the correlation coefficient. 

Based on 5 percent significance tests from the previous problem of the sample correlati 
and regression coefficients, what conclusion can you draw about the wisdom of conti; 
ing to require employees to study city street maps? 


£ 


studying each week. 
b. Collect these data for 10 to 20 students. oe 
c. Compute the regression equation and 7°. 
d. Interpret the explanatory power of your model. 
e. Estimate your value for Y given some X-value. 
J. Test the significance of the regression coefficient at the 1 percent level. 
a 


Cite two economic variables that economic theory tells us are related in a depende: 


manner (e.g., money supply and inflation rate, consumption and income). 
Identify which is the dependent variable and which is the independent variable. 


h 


sr 


source of economic data, collect 10 to 20 observation points. 
Plot a scatter diagram. Does there seem to be a relationship? 
Calculate the regression equation and 77, 

Interpret your results, 


SAR 


in collecting your original data set. 


Using the Survey of Current Business, The Federal Reserve Bulletin, or some oth: 


State clearly your conclusion regarding the validity of the economic theory you c! 
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@ Computer Exercises 


Access the file OUTPUT in your data bank. It consists of 20 observations for the monthly 
output levels of General Electric for the number of refrigerators produced in their Norfolk, 
Virginia, plant over most of 1987 and 1988. The data set also includes values for the usage of 
labor inputs measured in worker-hours for the same time period. Using regression and 
correlation analysis, determine if the amount of labor used in the production process can 
explain the level of output for GE. How strong is the relationship between the two variables? 
What happens to output as the level of labor goes up? Identify and interpret the standard error 
of the estimate. 


1. John Wood works as a ranger for the Texas state police force. Lately he has noticed an 
alarming trend toward higher recidivism rates (repeat offenses) by criminals in several 
southwestern counties in Texas, Law enforcement agencies are concerned about the 
escalating crime rates, and there is a growing public outcry to stem the rising tide of social 
disorder. Ranger Wood has been charged with the responsibility of identifying a method of 
predicting crime rates in anticipation of corrective action to be taken by the Texas police 
force. He feels that he must therefore isolate a predictive variable that might explain crime 
patterns. After much consideration he finally settles on county expenditures for crime 
abatement, feeling that there might be a relationship between crime rates and the amount of 
money spent to control lawlessness. Ranger Wood therefore randomly selects 10 countics 
in southwest Texas and obtains monthly data for the most recent month available on the 
number of serious crimes in each county and the amount of money spent by the county to 
combat crime. The results are 


County Expenditures (1,000’s) Crimes 


Maverick 10 4 
Crockett 12 4 
Pecos 9 a7 
El Paso 20 21 
Loving 15. : 5 
Jeff Davis 7 20 
Midland e | ` 29 
Coke g-i. 0 
Dawson 7 49 
Yoakum 12 4r 


Does it appear that Ranger Wood has uncovered.a method of predicting crime rates? If 
Pecos County budgets $16,000 next month for law enforcement protection, what is your 
estimate of the number of criminal acts that might be perpetrated? The Texas legislature 
would like an interval estimate of county crime rates if theyspend $13,000 on crime 
prevention in a given county. The legislators insist on 99 percent accuracy in that 
estimation. ý 
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Multiple 


Regression 
and Correlatio 


A Preview of Things to Look For 

1. The two additional assumptions associated with multiple regression. 
2. The use of analysis of variance to evaluate the model. 

3. How t-tests can be used to test each coefficient. 

4. What the adjusted coefficient of determination measures. 

5. 


. Problems, detection, and treatment of multicollinearity. 
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CHAPTER BLUEPRINT 


Many regression models require the use of more than one independent variable 
to explain the values that the dependent variable may take. While the methods 
of the previous chapter allowed us to consider only one independent variable, 
this chapter incorporates other independent variables to aid in the explanation 


of the dependent variable. 


Scanned with CamScanner 


700 Chapter Fourteen Multiple Regression and Correlation | 


Statistics show that you can never go broke taking a Profi, 


6 INTRODUCTION 
In the previous chapter we saw how a single independent variable could be useq 


predict the value for a dependent variable. We closely explored this amazing worl: 
simple regression and correlation, and examined how to identify and measure; 
statistical relationship between two variables. However, simple regression limits u, 
only one independent variable. Consider how much more useful and explanatory ; 
model might become if we were allowed to use more independent variables! Thi; 
precisely what multiple regression permits us to do. Multiple regression involves, 
use of two or more independent variables. The simple regression model was ¢ 
pressed as 


Y=b)+ bX +e (144 


The multiple regression model is 


+X, + 


bo + b,X, + aX, + 


where k is the number of independent variables and b; are the coefficients for t 
variables. In both models, e is the random error component made necessary becau 
not all observations fall directly on the regression line. In this manner, multir 
regression is a logical extension of the simple linear model developed in Chapter ! 
Our principal objective is the same as with simple regression: We want to calculate 
as an estimate of the population parameters fB; 


FORMULATION OF THE MODEL 


In the previous chapter, Hop Scotch Airlines developed a simple regression model! 
help them predict the number of passengers they might expect and to assist in planni": 
day-to-day operations. Their model contained only one explanatory variable: adve" 
ing. The regression equation was 


Ê =b + bX 
= 4.4 + 1.08X 
The regression coefficient of 1.08 told Hop Scotch's management that for every !-1" 
($1,000) increase in advertising, the number of passengers will increase by 1.08 uae 
(1,080 passengers). By calculating the coefficient of determination of r? = 0.94. "° 


found that their model exlains 94 percent'of the change in the number of passen” 
who bravely fly with Hop Scotch. 


\ 


Ani 
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Soon after developing the model, Hop Scotch hires Ace Rickenbacker as a 
management consultant. Ace is to serve as a marketing and financial specialist in the 
effort to increase Hop Scotch’s revenue. His first task is to expand the regression 
model, used to predict passengers, to include variables other than advertising. Ace 
must therefore identify other variables that might exlain changes in the number of 
passengers. As possible candidates, Ace considers such variables as the prices of train 
and bus tickets, customers’ income, population, and a host of other logical alterna- 
tives. To simplify our discussion, we assume that Ace begins by adding only one 
variable to the model. However, the anlaysis can be extended to include any number 
of independent variables. Ace recalls that a basic premise behind the theory of 
demand, as preached by economists, states that income is a primary determinant of 
consumers’ demand. He therefore feels that by incorporating a measure of consumer 
income, he can improve on Hop Scotch’s regression model. Ace therefore settles on 
national income as a second possible explanatory variable. (National income per 
capita would likely be a much better variable. However, for the purpose of illustration 
and variety in our variables, we will assume Ace chooses total national income as his 
second variable.) His model therefore becomes 


| Y=a + BX, + BX. +e 114.3] 
where Y is the number of passengers measured in units of 1,000 
xX, is Hop Scotch’s advertising expenditures measured in units of $1,000 


X is national income measured in units of trillions of dollars 


The sample regression model is 


f 


+ bX, + bX, [14.4] 


where the coefficients b, and b, are estimates of B, and B4, respectively. Actually, b, 
and b, are called partial (or net) regression coefficients, since we are working with a 
` multiple regression model, which contains more than one coefficient. However, the 
term partial is often understood given the context of multiple regression, and the 
expression is shortened to regression coefficient or just coefficient. The coefficients 
are interpreted much as they were in simple regression, The value b, is the amount by 
which Y will change for every one unit change in X; if X, is held constant. For every 
one unit increase in X}, Y will change by b, units if X, is held constant. 
` Multiple regression involves the same assumptions cited in the previous chapter 
for simple regression, plus two others. The first assumption requires that the number 
of observations, n, exceed the number of independent variables, k, by at least 2. In 
multiple regression there are k + 1 parameters to be estimated: coefficients for the 
k independent variables plus the intercept term. Therefore, the degrees of freedom 
associated with the model are d.f. = n — (k + 1). If we are to retain even one degree 
of freedom, n must exceed k by at least-2, so that n — (k + 1) is at least 1. 

The second assumption involves the relationship between.the independent vari- 
ables. It requires that none of the independent variables be linearly related. For 
example, if X, = Xz + X3, or perhaps X, = 0.5X2, then a linear relationship would 
exist between two or more independent variables and a serious problem would arise. 
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This problem is multicollinearity. 


— Tig 
Banm 3 od ri EES e A 
nearity  Multicolinearity. exists if one of the independent variables is tng, 
any of the others. © ture i 4 
ir z re oF Y 


Multicollinearity may cause the algebraic signs of the coefficients to be the opp, 
of what logic may dictate, while greatly increasing the standard error of the e 
cients. A more thorough discussion of multicollinearity follows later in this cha 


Ace’s Objective 
‘Ace must now devise estimates for B, and B, by determining values for b, and b,, 
know from our study of simple regression that the linear relationship between: 
variables can be expressed by a straight line, But, a mere line will not depic, 
relationship when more than two variables are involved. 

If three variables are involved, as in our case with Ace’s regression mod: 
regression plane is used. (The presence of more than three variables requir 


Figure 14-1 illustrates a regression plane for Hop Scotch’s model. The value for: 
dependent variable is shown on the single vertical axis. The coefficients are the slo; 
of the regression plane, and the intercept is shown by Bp. 


Figure 14-1 


A Regression Plane 
for Hop Scotch 
Airlines 


The values for b, and b, in the sample regression model are found much like 
was found in simple regression, We want estimates of the coefficients in the equ" 
of a plané:that will minimize the sum of the squared errors (Y, — P)?, If we can obt? 
these values for b, and b,, we will have developed an ordinary least-squares mo 
providing the best fit for our data, 3 


ye 
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e| The Normal Equations 


Computation of the sample coefficients is accomplished with the aid of the normal 
equations, The derivation of these normal equations requires differential calculus and 
is beyond the scope of this text. For those of you with knowledge of calculus, the - 
process involves taking the partial derivatives of the sum of the errors squared with 
respect to all b, and setting them equal to zero. However, for the rest of us, we will just 
happily accept their existence and go on from there. 

For a model with two explanatory variables, the normal equations are 


ZY = nb) + b,EX, + by EX, [14.5] 
EX,Y = boÈX, + b, UXT + AEX X, nas| 
EXY = boLX, + b\LX\Xq + D,EXZ [14.7] 


The solution of this system of equations requires matrix algebra and a good deal of 
time-consuming third-grade arithmetic. Therefore, the computations are generally 
done on a computer, and, for the most part, we follow this practice. However, in each 
instance, the hand calculations will be demonstrated in order to provide a full 
understanding of exactly what the statistic is measuring and how it can be interpreted. 
It is not enough to merely read the computer output. You must understand what 
calculations are necessary to obtain the statistics. This can be achieved only by 
examining the mechanics of the equations actually used to compute the statistics. The 
solution for the regression equation for Hop Scotch’s model is shown in the chapter 


appendix. 


c| Ace’s Solution 


Since Ace has chosen national income (N/) as his second explanatory variable, he 
must now obtain the proper data. Recall that the original data set contained 15 
observations for monthly values of (1) numbers of passengers and (2) advertising 
expenditures. From the Federal Reserve Bulletin or similar data source, Ace collects 
the levels of national income for those same 15 months. The complete data set would 
then appear as in Table 14-1. 

. With these data, Ace is now ready to compute his expanded regression model and 
to determine if it is an improvement over the simple model. This is the subject of the 
rest of this chapter. 

Using a computer, Ace derives the regression model illustrated by Formula (14.8). 


Ê = by + bX, + bX 


= 3,53 + 0.84Adv + 1.44NI 
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Multiple Regression 
Data for Hop Scotch 
Airlines 


3 Nationa 
Observation Passengers (Y) Advertising (X) ae X 
(months) in 1,000's) (in $1,000) Seh 
7 15 10 240 
2 7 2 27 
3 13 8 2.08 
4 23 7 3168 
5 16 10 256 
6 21 15 33 
7 14 10 224 
8 20 4 3.20 
9 2 19 = 
1 i 10 a 
i 16 11 207 
12 18 13 2.33 
13 23 16 2.98 
14 wA 10 1.94 
15 16 12 217 


Formula (14.8) is.a regression plane and represents the relationship among the thr: 
variables. The printout from an SPSS-PC computer run using the data Ace h 
collected is shown in Display 14-1. You may wish to examine the hand calculatior 
in the appendix at the end of this chapter, to gain some perspective into multip 


regression. 


DISPLAY 14-1 


. A Regression Run 
for Hop Scotch 
Airlines 


Dependent Variable. . PASS 


Equation Number 1 


Variable B SEB Beta T Sia!) 
NI 1.44097 -73604 -24880 1.958 073!) 
ADV -83966 «14191 «75197 5.917 „000| 
(Constant) 3.52840 -99942 3.530 -005'| 
End Block Number 1 ALL requested variables entered. | 


Given the interpretation of the partial regression coefficients noted earlier, A“ 
can sce that if advertising is increased by 1 unit and national income is held const" 
the number of passengers increases by 0.84 units. Since both variables were express 
in units of 1,000, this means that if Hop Scotch spends $1,000 more (less) on adv 
tising, assuming national income does not change, the number of passengers will 1” 
crease (decrease) by 840. Furthermore, if national income goes up (down) by 1 unit 
(S1 trillion) and advertising is held constant, passengers will. increase (decrease) 4 
1.44 units, or 1,440. 


D 
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E EVALUATING THE MODEL 


Now that Ace has his model, he must determine whether it represents an improvement 
over the simple regression model. Several tests can be used to evaluate a multiple 
regression model. In this section we (1) calculate and interpret the standard error of the 
estimate, (2) evaluate the entire model using ANOVA and the F-test, and (3) evaluate 
the contribution of each independent variable with the use of t-tests. In another 
section, we examine the coefficient of multiple determination as another method of 
evaluating the model. 


The Standard Error of the Estimate 


The interpretation of the standard error of the estimate is much the same as it was with 
the simple regression model. It measures the dispersion of the actual values of Y 
around those predicted by the model, Ê. It is a measure of the average amount by 
which the actual observations vary around the regression plane. The standard error of 
the estimate, Se, is found much as it was in the case of simple regression. The mean 
square error (MSE) is found by dividing the sum of the squared errors (SSE) by the 
degrees of freedom. r 
Since SSE = Z(Y; — Î)’, we have 


Then 


14.10] 


This formula requires that the predicted value of Y'(Î) be calculated for every 
observation. The error, the difference between this predicted value and the actual 
Y-value (Y), is then squared and summed for all observations. Obviously, such tedious 
calculations are seldom done by hand. The complete process is demonstrated in the 
chapter appendix to illustrate what the standard error is and what it measures. If hand 
calculation is necessary, Formula (14,11) provides an estimate of the standard error. 


Se = fe = boSY — BEX =. boEX,Y — «+ — BEX 
n-k-1 [14.11] 


Table 14-2 provides much of the computation necessary for Formula (14.11) which 
yields 


ee = = (3.53)(268) — (0.84)(3,490) — (1.44)(746.62) 
PET 15-2-1 
= 0.78 
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: 
Pass Adv r xY 


An Alternative Y x, xX X 
Method for 15 10 2.40 225 150 A 
Calculating Se 17 12 2.72 289 204 A 
5 13 8 2.08 169 104 : 
23 17 3.68 529 391 H 

16 10 2.56 256 160 
A 15 3.36 441 315 ; 

14 10 2.24 196 140 
20 14 3.20 400 280 & 
24 19 3.84 576 456 & 
7 10 272 289 170 K 
16 1 2.07 256 176 t 
18 13 2.33 324 234 
23 16 2.98 529 368 k 
15 10 1.94 225 150 z 
a 2 217 258 2 x 
268 4,960 3,490 78 

XN 


The work with Formula (14.11) and that found in the appendix provide the T 
necessary to understand what the standard error of the estimate is and wha | 
measures. Fortunately, most computer programs are designed to report this impor: 
statistic. Display 14-2 is the printout for the SPSS-PC run. The standard error of t 


estimate is seen to be 0.82167. $ 


fosecay 1-2 B Number 1 Dependent Variable.. PASS 


Variable(s) Entered on Step Number 


The Standard Error 1 RI 
of the Estimate for T ADV 
Hop Scotch 


Multiple R -97613 

R Square -95282 

Adjusted R Square -94496 at 
Standard Error -82167 sey 


Remember, the standard error of the estimate measures the dispersion of Ù 
actual, observed Y-values (Y;) around the regression plane. This is illustrated " 
Figure 14-2, The actual Y;values are dispersed about the regression plane. Te 
standard error of the estimate measures the degree of this dispersion. Of course, tke 
less the dispersion, the smaller the Se and the more accurate the model is in predicti 


and forecasting. 


p | 
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| rious e 


The Regression 
plane for Hop 
Scotch 


5| Evaluating the Model as a Whole 


Given his regression model, one of the first questions Ace must ask himself is, ‘‘Does 
it have any explanatory value?” This can perhaps best be answered by performing 
analysis of variance (ANOVA). The ANOVA procedure will test whether any of the 
independent variables has a relationship with the dependent variable. If an indepen- 
dent variable is not related to the Y-variable, its coefficient should be zero. That is, if 
X; is not related to Y, then B; = 0. The ANOVA procedure tests the null hypothesis 


that all the B-values are zero against the alternative that at least one B is not zero. 
That is, 


Hy: By = Ba = B3 = ++ = B, = 0 
H At least one B is not zero 


If the null is not rejected, then there is no linear relationship between Y and any of 
the independent variables. On the other hand, if the null is rejected, then at least one 
independent variable is linearly related to Y. - 

The ANOVA Process necessary to test the hypothesis was presented in Chap- 
ter 12. An ANOVA table is set up, and the F-test is used to make the determination. 
Table 14-3 provides the general format for an ANOVA table for multiple regression. 


Sum of Degrees of Mean 
A Generalized Source of Variation Squares Freedom Square FValue 
ANOVA Table Between samples (treatment) SSA k SSR MSR 
i F F= MSE 


Within samples (error) SSE n-k-1 _ SSE 


Total variation SST a-1 


————— 
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Note its similarity to ANOVA tables you have already seen. Notice that the degree, 
freedom for the regression sum of squares is equal to k, the number of indepeng, 
variables in the model, while the degrees of freedom for the error sum of square; 
n — k — 1. Each of the sums of squares is found exactly as it was for simy 


regression. 


SERU pan 
| SSR = XP — Y? pay 
| SSE = X(Y, — 7)? [14.14 


Table 14—4 provides the results in an ANOVA table for Hop Scotch Airline 
Display 14-3 is the ANOVA table reported by SPSS-PC. Ace can then use th 
information to test his hypothesis. 


E - eM 
Fal 


ANOVA Table for Source of Variatio! Squares Freedom Square 
Hop Scotch Between samples (treatment) 163.632 2 81.816 as 
Within samples (error) © 8.102 12 0.675 
Total variation 171,733 14 
| oiceuay 14-3 fl Analysis of Variance 
Degrees of 
The ANOVA Table Freedom Sum of Squares Mean Square 
Scotel Ri i 2 163.63171 81.81585 
cd pettus 12. 8.10162 „67514 
F = 121.18438 Signif F = .0000 


To determine if the model has any explanatory power, Ace must test ihe 
hypothesis 
Hg: Bi = B2 = 0 
Ha: At least one B is not zero 


Since the F-ratio is MSR/MSE, the degrees of freedom needed to perform an F- 
seen from Table 14-4 are 2 and 12. If Ace wants to test his hypothesis, at say» `- 
5 percent level, he finds from Table G that Fo 95,12 is 3.89. The decision rule 

Ace's hypothesis is do not reject if F < 3.89; reject if F > 3.89. This is displa 
Figure 14-3. Ace can plainly see that F = 121.18 > 3.89. He will therefore reject 


p 
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FIGURE 14-3 
r 


F-test for Ace and 
His Regression 
Model 


3.89 ! F 
121.18 


null hypothesis that B, = B, = 0. He can conclude with 95 percent confidence that a 
linear relationship exists between Y and at least one of the independent variables. 


c] Testing Individual Partial Regression Coefficients 


Ace has learned that at least one of the two independent variables has some relation- 
ship to the number of Hop Scotch passengers. The next logical step is to test each 
coefficient individually to determine which one (or ones) is significant. 

Again notice the great similarity with testing the slope coefficient under simple 
regression. The procedure uses a t-distribution, since n < 30, and tests the hypothesis 


Hy: B; = 0 
H,: B; #0 


The t-test statistic is 


pape sl 
Ss, [14.15] 


where b; is the individual coefficient being tested 
So; is the standard error of b; 


5, is used because if another sample of n = 15.was taken, different coefficients would 
result due to sampling error. That is, the coefficients would vary because the randomly 
selected observations in the second sample would not be the same as they were in the 
first sample. S,, is used to capture that variation. Like most statistics asssociated with 

- multiple regression, S,, is difficult to calculate by hand. If there are only two inde- 
pendent variables, 


gu Se 
w VIR, — XPA — Fh) [14.16] 


where r? is the squared correlation coefficient for the two independent variables. This 
correlation coefficient for X, and X, is conceptually the same as the correlation 
coefficient for X and Y that we calculated in Chapter 13 for simple regression. Here, 
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pendent and the ings, 


however, instead of measuring the correlation between the de : 
independent varia). 


dent variables, we are measuring the correlation between two 
is calculated as 


fae EX, X — (2X )EX)In 
12 VEL = EX VME — ŒX] 


[14.7 


Table 14—5 aids in the calculations. Then 
ai [525.38 — (187)(40.29)/15 
12" -V/[2469 — (187)2/15][113.34 — (40.29)7/15] 
= 0.8698 


SPSS-PC can be used to request a correlation matrix. The results, show, 
Display 14—4, reveal the correlation between NI and ADV to be 0.8698. 


s Adv NI (Adv)(NI) (Adv)? 
Computations for x, X% 00) (4 
the Correlation 10 240 24.00 100 
between Advertising 12 272 32.64 144 
(X,) and National 8 2.08 16.64 64 
Income (X,): f2 17 3.68 62.56 289 

10 2.56 25.60 100 
15 3.96 50.40 225 
10 2.24 22.40 100 
14 3.20 44.80 196 
19 3.84 72.96 361 
10 272 27.20 100 
"1 207 22.77 121 
13 233 30.29 169 
16 298 47.68 256 
10 1.94 19.40 100 
12 217 26.04 24 
187 4029 525.38 2,469 
3 
correietani mass aw a 
PASS 1.0000 9684" =. 90296" 
A Correlation Matrix ADV  .9684+* 1.0000 8698" 
for Hop Scotch NI .9029** .8698** 1.0000 
N of cases: 15 1-tailed Signif: *-.01 **- 001 


is printed if a coefficient cannot be computed 
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Ace can now test the significance of b, the coefficient for advertising. The 
hypothesis he is testing is whether advertising contributes any explanatory power to 
the model designed to explain Hop Scotch’s passengers. The hypothesis is stated as 

Ho: B, = 0 
HB #0 
If the null is not rejected, it can be concluded that no linear relationship exists between 
advertising and number of passengers. This would mean that advertising offers 
nothing in the way of explaining the number of passengers. Ace must first calculate 
3 Se; Again, a presentation as in Table 14-6 is helpful in obtaining Sp; Recall that the 
~~ mean value for advertising expenditures is X, = 12.467. 
iy Se 
1 VEX -XPA = ri) 
0.8217 
V[137.73][1 — (0.8698)°] 
= 0.14191 


Sp 


Then 


= 0.14191 
= 5917 


The t-test for the hypothesis is shown in Figure 14—4. 


Adv 
Computations for X%-X, (X, - XP 
Standard Error of 10 2.487 6.0861 
the First Regression 49 -0.487 02181 
Coefficient b; 8 —4.467 19.9541 
17 4.533 20.5481 
10 -2.467 5 6.0861 
15 2533 6.4161 
10 -2.467 6.0861 
14 1.533 23501 
19 6.533 42.6801 
10 —2.467 6.0861 
q1 —1.467 2.1521 
13 0.533 0.2841 
16 3.533 12,4821 
10 -247 6.0861 
12 —0.467 0.2181 


137.7335 = E(X, — X}? 
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2179 1 ; 
5.915 


-2.179 0. 


Assume Ace wishes to test at the a = 0.05 level of significance. Recall from. 
ANOVA table: that the number of degrees of freedom for the test is n — k~] 
15 — 2 — 1 = 12. The critical t-values taken from Table F are fo.os,12 = 2-179. I 
two-tailed test because the 1-value may be significantly large or significantly sm; 
The decision rule is 


Decision Rute Do not reject the null if —2.179 < £ < 2.179. Reject if 1 < 
=2.179 or t > 2.179. 


The value of the test statistic of 5.917 calculated from the sample data is clearly in: 
upper rejection region. Ace can be 95 percent confident that the null of B, = 0 sho. 
be rejected. Advertising does serve as an explanatory factor for Hop Scotch’s p: 
senger list., 

Remember that the ability to merely read numbers on the computer printout is: 
sufficient to obtain a true understanding of multiple regression. Although you m. 
never have to perform these elaborate computations by hand, you must neverthel: 
acquire an intuitive understanding of the nature of regression analysis. This can : 
done only by examining these formulas and their manipulations. Display 14-5 (whi 
is the same as Display 14-1) provides the SPSS-PC printout showing the stand 
error of the regression coefficient. A 


Ace's Regression 
Model 


Equation Number 1 Dependent Variable.. PASS 
----------- Variables in the Equation ----------- 
Variable B SEB Beta T SigT 

NI 1.44097 -73604 -24880 1.958 .0739 

ADV .83966 514191 .75197 5.947 .0001 
(Constant) 3.52840 -99942 3.530 .0041 hi 


End Block Number 1 ALL requested variables entered. 


Notice also the SIG T value of 0.0001 serves as the p-value for the test. Recall t 
the p-value is the lowest level of significance at which the null can be rejecte? 
According to the printout; advertising is significant, and the null should be rejected.” 
any level of significance above 0.0001 (or 0.01%). Thus, for example, the use of * 
of the customary levels of significance, of 1, 5, or 10 percent, would result in reje of 
-of the null, andthe conclusion that advertising has a significant role to play it i 
explanation of passengers. 


P 
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This same test for significance can also be performed for b}, the coefficient for 
NI. According to the printout, NI has a p-value (SIG T) of 7.39 percent. Thus, NI 
should prove significant at any a-value above 7.39 percent, If we were to test at the 
5 percent level, the critical 4, as noted above, is +2.179, 


Decision Rue Do not reject if -2.179 < t < 2.179. Reject otherwise. 


The t-value reported for NI on the printout is seen to be 1.958, which is in the do not 
reject region. Thus, the hypothesis that B = 0 is not rejected, and it is concluded that 
at the 5 percent level of significance NI has no explanatory power. 

However, if the test is performed at the 10 percent level, a different conclusion is 
Teached, as seen in Figure 14-5. From Table F in Appendix III we find fo.19,12 = 


£1,782, 
A 10 Percent Test of a =0.05 
Significance of 
National Income 
i -2.179 0 196 2.179 i 


—1.782 0 1.782 t 


1 


Decision Rute Do not reject if — 1.782 < t < 1.782. Reject otherwise, 


Since t = 1.958, Ace must reject the null that B, = 0 at the 10 percent level of 
significance and conclude with 90 percent confidence that NI does have some 
explanatory linear relationship with passengers. $ e 

Our tests show that NI proves significant at the 10 percent level, but at the 5 per- 
cent level of significance the hypothesis that B, = 0 cannot be rejected. These results 
correspond with the p-value for NI, which states that the hypothesis B, = 0 can be 
rejected at any level of significance above 7.39 percent. 

This demonstrates the need for the researcher to choose the value for œ prior to the 
test. Since different values for a can result in different conclusions, ethical and 
impartial research requires that the a-value be determined on the basis of the 
consequences of a Type I error relative to those of a Type II error as discussed in 
Chapter 9. 

In summary, Ace can report to Hop Scotch’s management that, at the 5 percent 
level, advertising proves to be a significant explanatory variable for passengers, while 
national income does not appear significant. Of course, given the F-test performed 
earlier at the 5 percent level; Ace expected at least one significant variable. At the 
10 percent level, both advertising and national income prove significant. 
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PRM a I Ace performed a 1 percent test of significance, would he find 
a. The model as a whole is significant? 


b. Advertising is significant? 


Answer: 
a Foo12,12 = 6.93. Since F = 121.18 > 6.93, reject null. 


‘ b. At 1 percent the critical £ = 3.055. Since t = 5.917, reject null. Advenisjp, 
significant at 1 percent. k 


MuLTIPLE CORRELATION 
Still another tool Ace can use to evaluate his model is the coefficient of mult 
determination. For the sake of convenience, the term multiple is often assumed gi;. 
the context of the discussion, and the expression is shortened to coefficient 
determination, the same as for the simple model. Another similarity between ņ 
„simple model and a model containing two or more explanatory variables is t 
interpretation of the coefficient of determination. In both instances it measures ų 
portion of the change in Y explained by all the independent variables in the mod: 


7 ng EEE 


To measure that portion of the total change in Y explained by the regressi 
model, we use the ratio of explained variation to total variation just as we did int 
case of simple regression. As we noted in Chapter 13, by variation we mean t 
variation in the observed Y-values (¥;) from their mean (Y). ‘The variation in Y tha: 
explained by our model is reflected by the regression sum of squares (SSR). The to: 

Variation in Y is, in turn, measured by the total sum of squares (SST). Thus, 


p = SSR 
SST (14.18) 
Since SST = SSR + SSE, we also have 
SSE { 
R=- aS 
SST [14.19 


ssi” 


Notice that the coefficient is r? in the simple model and R? in our present discu: 
From Display 14-2, shown earlier, we see that * 
SSR 
R= TST 
= 163.632 
171,733 
0.953 
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Thus, 95.3 percent of the change in the number of passengers Hop Scotch transports is 
explained by changes in advertising and national income. This compares favorably to 
r = 0.93 for the simple model in Chapter 13 containing only advertising. By 
incorporating NI as a second independent variable, we have increased the explanatory 
power of the model from 93 to 95.3 percent. 

To avoid the need to calculate SSR and SST, we can find R? by 


= MboÈY + biZX,Y + b, ZXY + +++ + b ZXY] - CY? 


R? 
nY — (ZYP [14.20] 


_ 15[3.53)(268) + (0.84)(3,490) + (1.44)(746.62)] — (268) 
(15)(4,960) — (268)? 


= 0.957 


The slight difference in R? using Formulas (14.18) and (14.20) is due to rounding. 
As with the simple model, we always find 0 £ R? £ 1. Of course, the higher the R?, 
the more explanatory power the model has. 


The Adjusted Coefficient of Determination 


Because of its importance, R? is reported by most computer packages. It is a quick and 
easy way to evaluate the regression model and to determine how well the model fits 
the data, Outside of the regression coefficients themselves, R? is perhaps the most 
commonly observed and closely watched statistic in regression analysis. 

However, it is possible for careless—or unscrupulous—statisticians to artificially 
inflate R2, One can achieve an increase in R? merely by adding another independent 
variable to the model. Even if some nonsensical variable with truly no explanatory 
power is incorporated into the model, R? will rise. Ace could “‘pump up” his R? by 
adding to the model, as an explanatory variable, the tonnage of sea trout caught by 
sport fishing off the Florida coast. Now, obviously, fishing has little or nothing to do 
with Hop Scotch’s passenger list. Yet there is probably at least a tiny bit of totally 
coincidental correlation, either positive or negative, between fishing and air travel. 
Even a minute degree of correlation will inflate R?. By adding several of these absurd 
“explanatory” variables, Ace could illegitimately increase his R? until it approached 
100 percent. A model of this nature may appear to fit the data quite well, but would 
produce wretched results in any attempt to predict or forecast the value for the inde- 
pendent variable. 

It is therefore a common practice in multiple regression and correlation analysis to 
report the adjusted coefficient of determination. Symbolized as R?, and read as “R 
bar squared,” this statistic adjusts the measure of explanatory power for the number of 
degrees of freedom. The degrees of freedom for SSE isn — k — 1. The researcher 
loses one degree of freedom for every additional independent variable added to the 
model, because each variable requires the calculation of another b, R? will penalize 
the researcher for incorporating a variable that does not add: cnough explanatory 
power to the model to justify the loss of a degree of freedom. The value of R? will go 
down, If it decreases too much, consideration must be given to excluding that variable 
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from the model. In extreme cases, the adjusted coefficient of determination , 


actually become less than zero. ath i 
This adjusted coefficient is obtained by dividing SSE and SST by their Fespecy, 


degrees of freedom. 


Pai- SSEXn — k — 1) 
SST/An — 1) [14.2 


A more computationally convenient formula for R? is 


Since the numerator in Formula (14.21) is the MSE, it may be said the R? i; 
combination of the two measures of the performance of a regression model: the me 
square error and the coefficient of determination. 
By using the data from his model, Ace can determine R? as 
Isl 
15-2 


Œ = 1 — (1 — 0.953) 


= 0.949 


After adjusting for the degrees of freedom, Ace’s model reports an R? of 94.9 percer 
As you might expect, most computer programs also report the adjusted coefficier 
of determination. Display 14-2 for SPSS-PC reveals an Adjusted R Square c 


0.94496. 


THE PRESENCE OF MULTICOLLINEARITY 


Earlier we noted the danger of multicollinearity. This problem arises when one of th: 
independent variables is linearly related to one or more of the other independ:r| 
variables. Such a situation violates one of the conditions for multiple regression 
Specifically, multicollinearity occurs if there is a high correlation between w 
independent variables, X, and X, In Chapter 13 we discussed the correlation coeffi 
cient r for the dependent variable and the single independent variable. If this sat? 
concept is applied to two independent variables, X; and X, in multiple regression, W 
can calculate the correlation coefficient r; If ry is high, multicollinearity exists. 

What is high? Unfortunately, there is no answer to this critical question. There 
no magic cutoff point at which the correlation is judged to be too high and multi- 
collinearity exists. Multicollinearity is a problem of degree. Any time two or mo® 
independent variables are linearly related, some degree of multicollinearity exists li 
its presence becomes too pronounced, the model is adversely affected. What n 
considered too high is largely a judgment. call by the researcher. Some insig! 
necessary to make that call is provided in this section. 


Da 
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Assume you are using regression techniques to estimate a demand curve (or 
demand function) for your product. Recognizing that the number of consumers is 
related to demand, you choose as explanatory variables 


X, = All men in the market area 
X, = All women in the market area 
X, = Total population in the market area 


Obviously, X, is a linear combination of X, and X; (X3 = X, + X2). The correlation 
113 between X, and X, and the correlation rz, between X, and X, are quite high. This 
ensures the presence of multicollinearity and creates many problems in the use of 
Tegression techniques. A discussion of some of the common problems follows. 


The Problems of Multicollinearity 


One of the more vexing problems of multicollinearity arises from our inability to 
separate the individual effects of each independent variable on Y. In the presence of 
multicollinearity, it is impossible to disentangle the effects of each X; Suppose in the 
model 


Ê = 40 + 10X, + 8X, 


X, and X, showed a high degree of correlation. In this case, the coefficient of 10 for X, 
may not represent the true effect of X, on Y. The regression coefficients become 
unreliable and cannot be taken as estimates of the change in Y given a one-unit change 
in the independent variable. 

Furthermore, the standard errors of the coefficients, S, p become inflated. If two or 
more samples of the same size are taken, a large variation in the coefficients would be 
found, In the model specified above, instead of 10 as the coefficient of X}, a second 
sample might yield a coefficient of 15 or 20. If b, varies that much from one sample to 
the next, we must question its accuracy. 

Multicollinearity can even cause the sign of the coefficient to be opposite that 
which logic would dictate. For example, if you included price as a variable in the 
estimation of your demand curve, you might find it took on a positive sign. This 
implies that as the price of a good goes up, consumers buy more of it. This is an 
obvious violation of the logic behind demand theory. 


e| Detecting Multicollinearity 


Perhaps the most direct way of testing for multicollinearity is to produce a corre- 
lation matrix for all variables in the model, as shown in Display 14-4. The value of 
r2 = 0.8698 for the correlation between the two independent variables indicates that 
NI and ADV are closely related. Although there is no predetermined value for ry 
which signals the onset of multicollinearity, a value of 0.8698 is probably high enough 
to indicate a significant problem. 

Some of the guesswork can be eliminated by using a t-test to determine if the level 
of correlation between X, and X; differs significantly from zero, Given the nonzero 
relationship between X, and X, (rı2 = 0.8698) in our sample, we wish to test the 
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P 
f 
hypothesis that the correlation between X, and X3 is zero at the population leve, | 
will test the hypothesis that 

Ho! Pi2 = O 

Hg: P2 # 0 


using the techniques in Chapter 13. There we demonstrated that 


T2 
t= 
S, 


where 
= j} -ri 
S, = |- 


As an illustration, the hypothesis that pz = 0, where p;z is the population correla, 
coefficient for the two independent variables, is 


_ f= (08698 
SENDE, 


= 0.1367 


0.8698 
0.1367 


= 6.36 


where p;, is the population correlation coefficient between X, and X2. We can do), 


If wis sct at 5 percent, the critical tj 95,13 = 2-16. There are n — 2 degrees of freedon 


Decision Rute Do not reject 
1> 2.16. 


Since t = 6.36 > 2.16, Ace can reject the null that there is no correlation between X 
and X, (p12 = 0). Some multicollinearity does exist. This does not mean that th: 
model is irrevocably defective. In fact, very few models would be totally free d 
multicollinearity. How to handle this problem is discussed shortly. 

Another way to detect multicollinearity is to compare the coefficients of deter 
mination between the dependent variable and each of the independent variables. Wè 
found the correlation between passengers and advertising to be 7? = 0.937, while te! 
between passengers and national income is r? = 0,815. Yet together the two indepet™ 
dent variables revealed R? of only 0.957. If taken separately, the two independ?" 
variables explain 93.7 and 81.5 percent of the change in Y. But in combination th?) 
explain only 95.7 percent. Apparently there is some overlap in their explanalc!? 
power. Including the second variable of NI did little to raise the model's ability ° 
explain the level of the passengers, Much of the information about passengers alre?* 
provided by advertising is merely duplicated by NJ. This-is an indication that 
multicollinearity might be present. j 

A third way to detect multicollinearity is to use the variance inflation factor 
(VIF). The VIF associated with any X-variable is found by regressing it on all "° | 


if —2.16 < t < 2.16. Reject if t < —2.16 or 


S 
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other X-variables. The resulting R? is then used to calculate that variable’s VIF. The 
VIF for any X; represents that variable's influence on multicollinearity. 


S T Variance Inflation Factor The VF for any independent variable is a measure of we 
“degree of muttcolinearity contribute d by Hat variable.” 


Since there are only two independent variables in Hop Scotch’s model, regressing 
X, on all other independent variables (X,) or regressing Xz on all other independent 
variables (X,) yields the same correlation coefficient (r) = 0.8698), as shown in 
Table 14-6. The VIF for any given independent variable X; is 


i 
VIF(X;) = —; 
OO [14.23] 


where R? is the coefficient of determination obtained by regressing X; on all other 
independent variables. As noted, multicollinearity produces an increase in the varia- 
tion, or standard error, of the regression coefficient. VIF measures the increase in the 
variance of the regression coefficient over that which would occur if multicollinearity 
were not present. 


The VIF for advertising in Ace’s model is 
1 


1 — (0.8698)? 
=41 


VIF(X,) = 


‘The same VIF for X, would be found since there are only two independent variables. 
If an independent variable is totally unrelated to any other independent variable, its 
VIF equals 1. The variance in b, and by is therefore more than four times what it 
should be without multicollinearity in the model. However, in general, multi- 
collinearity is not considered a significant problem unless the VIF of a single X; 
measures at least 10, or the sum of the VIF’s for all X; is at least 10. 

Other indications of multicollinearity include large changes in coefficients or their 
sign when there is a small change in the number of observations. Furthermore, if the 
F-ratio is significant and the t-values are not, multicollinearity may be present. If the 
addition or deletion of a variable produces large changes in coefficients or their signs, 
multicollinearity may exist. 

In summary, in the presence of multicollinearity we find 


1. An inability to separate the net effect of individual independent variables upon Y. 
2. An exaggerated standard error for the b-coefficients. 

3. Algebraic signs of the coefficients that violate logic. 

4, “A high correlation between independent variables, and a high VIF. 


5. Large changes in coefficients or their signs if the number of observations is 
changed by a single observation. 


6. A significant F-ratio combined with insignificant t-ratios. 
7. Large changes in coefficients or their signs when a variable is added or deleted. 
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Treating Multicollinearity 


What can be done to eliminate or mitigate the influence of multicollinearity? Per, 
the most logical solution is to drop the offending variable. If X; and X; are c! 
related, one of them can simply be excluded from the model. After all, due to oye, 

the inclusion of the second variable adds little to the further explanation of y. | 

In reference to Hop Scotch’s model, it might be advisable to drop NI sino, 
correlation with Y is less than that of advertising. The t-tests performed carlier , 
suggested that NI was not significant at the 5 percent level. 

However, simply dropping one of the variables can lead to specification bia, 
which the form of the model is in disagreement with its theoretical founda, 
Multicollinearity might be avoided, for example, if income were eliminated fro, 
functional expression for consumer demand. However, economic theory, as well 
plain common sense, tells us that income should be included in any attempt to exp); 


A- Mmisspecification ‘of a model due 
eoretical 


effective method of eliminating multicollinearity. This, too, could be applied to N 

It is also possible to combine two or more variables. This could be done with t: 
model for consumer demand, which employed X, = men, X, = women, and X; = 
total population. Variables X, and X, could be added to form X3. The model wou. 
then consist of only one explanatory variable. 

In any event, we should recognize that some degree of multicollinearity exists i 
most regression models containing two or more independent variables. The greater ti: 
number of independent variables, the greater the likelihood of multicollinearit 
However, this will not necessarily detract from the model’s usefulness because th 
problem of multicollinearity may not be severe. Multicollinearity will cause 
errors in individual coefficients, yet the combined effect of these coefficients is n° 
drastically altered. A predictive model designed to predict the value of Y on the bas 
of all X, taken in combination will still possess considerable accuracy. Only expla! 
tory models, created to explain the contribution to the value of Y by each X; tend ¢ 
collapse in the face of multicollinearity. 


s COMPARING REGRESSION COEFFICIENTS 


After developing the complete model, there is often a tendency to compare regressi? 7 
coefficients to determine which variable exerts more influence on Y. This danger" 
temptation must be avoided, For the model i 


Se 
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Ê = 40 + 10x, + 200X, 


where Yis tons of output, X, is units of labor input, and X; is units of capital input, one 
might conclude that capital is more important than labor in determining output, since it 
has the larger coefficient. After all, a 1-unit increase in capital, holding labor constant, 
results in a 200-unit increase in output. However, such a comparison is not possible. 
All variables are measured in totally dissimilar units; one in units of weight, another in 
number of people, and a third in machines. 

Measuring all the variables in the same manner still does not allow us to judge 
relative impact of independent variables based on the size of their coefficients. 
Suppose a model is stated in terms of monetary units, such as 


Ê = 50 + 10,000X, + 20X, 


where Y is in dollars, X, is in units of $1,000, and X, is in cents, Despite the large 
coefficient for X,, it is not possible to conclude that it is of greater impact. A $1,000 
(1 unit) increase in X, increases Y by 10,000 units. A $1,000 (100,000 units) increase 
in X, will increase Y by 2,000,000 units (100,000 x 20). 

Even if we express Y, X,, and X, in units of $1, we cannot compare the relative 
impact of X, and X, on changes in Y. Factors other than a variable’s coefficient 
determine its total impact on Y. For example, the variance in a variable is quite 
important in determining its influence on Y. The variance measures how often and 
how much a variable changes. Thus, a variable may have a large coefficient and, every 
time it changes it affects Y noticeably. But if its variance is very small and it changes 
only once in a millennium, its overall impact on Y will be negligible. 

To offset these shortcomings, we sometimes measure the response of Y to changes 
in the standardized regression coefficients. Standard regression coefficients, also 
called beta coefficients (not to be confused with the beta value B, which is the un- 
known coefficient at the population level), reflect the change in the mean response of 
Y, measured in the number of standard deviations of Y, to changes in X;, measured in 
the number of standard deviations of X;. The intended effect of calculating beta values 
is to make the coefficients ‘‘dimensionless.” 

These beta values are reported by most computer programs, and are seen in 
Display 14-5. A one-standard deviation change in NI results in a 0.24880 standard 
deviation in Y. However, these beta coefficients suffer many of the same deficiencies 
as the normal coefficients. Hence, it is generally considered poor practice to reflect the 
importance of a variable on the basis of its beta coefficient. 


It is entirely possible that Ace may decide to include a third explanatory variable in his 
model. If he felt that, for example, the level of competition Hop Scotch encountered 
from other airlines might affect sales, Ace should consider including some measure- 
ment of the competitive forces in the market in his model to predict and explain sales 
revenues. The contribution such a measurement might provide to the model would 
then be examined in terms of its impact on the adjusted coefficient of determination, 
the standard error of the estimate, and other statistical measures presented throughout 
our discussion of regression and correlation analysis. 
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Actually, Ace may test any number of variables he feels might add to ,| 
explanatory power of his model. It is not uncommon for a model of this natur ' 
include several explanatory variables; in fact, the failure to include any us 
explanatory variable detracts from the model’s total predictive ability. \ 
However, it is important to include only those variables that offer a defin! 
contribution to the model’s explanatory power. Including variables of little value ¢! 
reduce the adjusted coefficient of determination and contribute to the problem > 
multicollinearity. Any model should therefore use as few variables as Possible 
achieve its level of explanatory power. Such a model is said to be parsimonious 


Parsimon y ible in the formation of a 


STEPWISE REGRESSION. 


Many modern computer packages offer a procedure which allows the statistician tt; 
option of permitting the computer to select the desired independent variables from; 
prescribed list of possibilities. The statistician provides the data for several potenti; 
explanatory variables and then, with certain commands, instructs the computer ) 
determine which of those variables are best suited to formulate the complete mode 

In this manner, the regression model is developed in stages; this is known s 
stepwise regression. It can take the form of (1) backward elimination or (2) forwa:! 
selection. Let’s take a look at each. 


Backward Elimination 


To execute backward elimination, we calculate the entire model, using all independex | 
variables. The r-values are then computed for all coefficients. If any prove to bt 
insignificant, the one with a t-value closest to. zero is eliminated and the model i 
calculated again. This continues until all remaining b; are significantly different fror 


zero. | 


E| Forward Selection 


As the name implies, forward selection is the opposite of backward elimination. First- 
the variable most highly correlated with Y is selected for inclusion in the model. Thè 
second step is the selection of a second variable based on its ability to explain Y giv’? 
that the first variable is already in the model. The selection of the second variable is | 
based on its partial coefficient of determination, which is a variable’s margin?! | 
contribution to the explanatory power of the model, given the presence of the fi! | 
variable. 

_Assume, for example, that the first variable selected is X;. Every possible tW% 
variable model is computed in which one of those variables is X;. That model which | 


| 
S 
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produces the highest R? is chosen. This process continues until all X-variables are in 
the model or until the addition of another variable does not result in a significant 
increase in R?, 

Although stepwise regression appears to be a convenient and effective method of 
model specification, certain precautions must be taken. The process will “mine” the 
data, prospecting for a statistically accurate model with the highest R2. However, a 
computer cannot think or reason, and the resulting model may be statistically sound 
but contrary to all logical and theoretical principles, and thereby suffer from specifica- 
tion bias. Stepwise regression should therefore be used with extreme caution, and any 
model formulated in this manner should be closely scrutinized. 


COMPUTER APPLICATIONS 


As noted, multiple regression and correlation techniques are, for the most part, accomplished 
with the aid of a computer package. After working through the regression model for Hop 
Scotch, it should be obvious that with many variables and/or observations, the computations 
necessary for multiple regression and correlations can be overwhelming. This section illustrates 
how three popular programs handle the required techniques. 


Minitab 

After the data have been loaded into a Minitab worksheet, it is a simple matter to complete a 
multiple regression model. Presume Ace puts the data for passengers in Column 1 of the 
worksheet, and in Columns 2 and 3, the values for advertising and national income, respec- 
tively. The Minitab command to regress passengers on the two independent variables is 


MTB>REGRESS C1 on 2 C2.C3 


The resulting printout is shown in Minitab 1. The regression equation is displayed at the top, 
followed by a more complete description of each variable. The coefficients (coef) are specified, 
along with their standard deviations and the t-values we calculated earlier. The p-values are 
also given. Notice that the variable NI is not significant at the 5 percent level, but is at the 10 
percent level. This is exactly what we found in our previous calculations. The standard error of 
0.8217 is also there. Values for R? and R? follow. The ANOVA table completes the Minitab 
printout. All the information we so laboriously calculated by hand in the chapter is quickly 
obtained with Minitab. 


Minitab 1 


The regression equation is 
c1 = 3.53 +0.840 C2 + 1.44 C3 


Predictor Coef Stdev t-ratio P. VIF 
Constant 3.5284 0.9994 3.53 0.004 

c2 0.8397 0.1419 5.92 0.000 4.1 
c3 1.4410 0.7360 1.96 0.074 4.1 
s = 0.8217 R-sq = 95.3% R-sq(adj) = 94.5% 
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Degrees of Sum of Mean 

Source Freedom Squares Square F i 
Regression 1 163.632 81.816 121.18 0.9; 
Error 12 8.102 0.675 | 
Total 14 171.733 | 
SOURCE DF SEQ SS | 
c2 1 161.064 | 
c3 1 2.588 


Minitab also offers the subcommand SUBC to test for multicollinearity. Determine te 
VIF as follows: | 


MTB>REGRESS C1 on 2 C2 C3; 
SUBC>VIF. 


You can produce the correlation’ matrix by the command 
NTB>CORR C1-C3 


Minitab 2 is the resulting matrix. It shows the simple correlation between each pair of 


variables. 
Minitab 2 
c1 c2 
c2, 0.968 
cs 0.903 0.870 


SS he eee 
As with simple regression, Minitab provides 95 percent interval estimates with a SUBC 
for specified values for the independent variables. Thus, for example, 


NTB>REGRESS C1 on 2 C2 C3; 
SUBC>PREDICT 10 2.7. 


generates the 95 percent interval for the conditional mean and the predictive interval whe! 
ADV is 10 and NI is 2.7. ipes 


Ø sas 


The SAS program necessary to run the multiple regression for Hop Scotch is 
DATA; 

INPUT PASS ADV NI; 

CARDS; 


data Lines go here 
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PROC REG; 
MODEL PASS = ADV NI; 


The subsequent printout is shown in SAS 1. The printout contains all the statistics we 
calculated in the chapter, and is self-explanatory. For a complete discussion, see the section on 
Minitab. The variance inflation factors are obtained by altering the last line in the SAS program 
to read 


MODEL PASS = ADV NI / VIF; 


The correlation matrix is obtained by adding to the above program the commands 


PROC CORR; 
VAR PASS ADV NI; 
SAS'1 
pe a i ee eS 
DEP VARIABLE PASS 


SUM OF MEAN 
SOURCE DF SQUARES SQUARE F VALUE PROB > F 
MODEL 2 163.6317 81.8158 121.184 0000 
ERROR 12 8.1016 +6751 
C Total 14 
PARAMETER STANDARD T FOR HO: 

VARIABLE ESTIMATE ERROR PARAMETER = 0 PROB > ITI 
INTERCEP 3.528 -9994 3.530 0041 

NI 1.441 «7360 1.958 -0739 

ADV -839 -1419 5.917 -0001 
E srss-rc 


‘The program to execute Hop Scotch’s multiple regression model is 


DATA List Free/Pass Adv NI. 
Begin Data. 


data go here 


End data. 
Regression Variables = Pass Adv NI/Dependent = Pass/Method Enter. 


The printout looks much like those specified above. The command 


REGRESSION VARIABLES PASS ADV NI/DEPENDENT PASS/METHOD ENTER 
/CASEWISE ALL. 


provides the residuals for all CASES (observations). You will also find this command useful: 


REGRESSION VARIABLES PASS ADV NI/STATISTICS CI TOL 
/METHOD ENTER. 


This generates the 95 percent confidence intervals, while the inclusion of TOL produces the 
tolerance value, The reciprocal of the tolerance value is the VIF. 
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CBS easily provides results for multiple regression models, Select Number 10, Multip, 
Regression analysis, and then 1 for Enter Data from Keyboard. Then select R for raw data, th, 
number of variables, their labels, and the number of data points. You can then toggle on option 
correlation coefficients, and forecasting if you desire. The printo 


for residual analysis, 
e chapter. The display shown here is » 


contain the same information we discussed in thi 
annotated printout of what CBS generates. 
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Durbin-Watson Statistic: 


CBS 1 ‘ 
Results 
Variable coeff stderr Beta t value p valu 
ADV 0.8397 0.1419 0.7520 5.9170 o 
NI 1.4610 0.7360 0.2488 1.9577 0-0749 
BO Intercept: 3.5284 
n Critical t: - 2.1790 
Results | 

Sum Squares Regression: 163.6317 | 
Sum Squares Error: 8.1016 
Sum Squares Total: 171.7333 
Mean Square Regression: 81,8159 
Mean Square Residual: 0.6751 
C.0.D. (R-Squared): 0.9528 
Adjusted C.0.D. (R-Squared): 0.9450 
Multiple Correlation Coefficient: 0.9761 
Standard Error Estimate: 0.8217 
dof Regression: 2 
dOf Error: 12 

ý Critical F: 3.8900 
Computed F: 121.1843 
F(p value): 0.0001 

1.9515 


O CHAPTER CHECKLIST 
After studying this chapter, as a test of your knowledge of the essential points, can you 


—_— Interpret the regression coefficients for a multiple regression model? 
—— Create and interpret the ANOVA table for a multiple regression model? 
— Test hypotheses for the regression coefficients? 

— Interpret the adjusted coefficient of determination? 

— Define multicollinearity? 


p 
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— Discuss the causes and problems of multicollinearity, and demonstrate how to detect its 
presence? 


— Calculate and interpret variance inflation factors? 


OF SYMBOLS AND TERMS 


Se Standard error of the estimate 

Ss, Standard error of regression coefficient 

R Coefficient of (multiple) determination 

R Adjusted coefficient of determination 

Regression Plane A geometric plane with slopes that represent the values of the 
two independent variables in a multiple regression. 

Multicollinearity The problem that arises when two or more independent vari- 
ables are linearly related. 

Adjusted Coefficient The coefficient of determination that has been adjusted for the 

of Determination degrees of freedom. 

Variance Inflation Factor The measure of a variable contribution to multicollinearity. 

Specification Bias The improper inclusion or exclusion of variables from the 
model. 


6 OF FORMULAS 


The standard 


emor of the 
estimate mea- 
SY? — WEY — b,OX\Y — bySXY — +++ —B2X,Y sures the dis- 
Se = —-k-1 persion of the 
a Y variables 
around their 

mean. 
The t-valuc tests the signifi- 
cance of the partial regres- 


So sion coefficient. 
The standard error of the 
Se regression coefficient mea- 
S= Vaca a sures the variation in the 
ý SX, — XPA = rid) coefficients from one sam- 


ple to the next. 


The adjusted coefficient of 
determination takes into 
account the degrees of free- 
dom. 

A variable’s variance infla- 
tion factor measures that 
variable’s contribution to 
the problem of multi- 
collinearity. 


ee E a-l 
P=ISa- R Tk- 
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Ei. EXERCISES 


E You Make the Decision 
1. An economist for the Federal Reserve Board proposed to estimate the Dow Jon, 
industrial average using, as explanatory variables, Xj; interest rate on AAA copog 
bonds; Xz; interest rates on U.S. Treasury securities. Your advice is requested. Hy, 
would you respond, and what statistical problem will likely be encountered? 


E Conceptual Questions 
2. Given the regression equation, with t-values in parentheses, 
Ê = 100 +:17X, + 80X2 
(0.73) (6.21) 
How might you improve this model? 
3. For the equation 
¥ = 100 — 20X, + 40X, 
a. what is the estimated impact of X, on Y? 
b. what conditions regarding X, must be observed in answering Part a? 
4. A demand function is expressed as 


Ô = 10 + 12P + 8 


where Q is quantity demanded, P is price, and / is consumer income. How would yo 
respond to this equation? 

5. Why does the presence of multicollinearity increase the probability of a Type II error in 
the hypothesis test of B,? 


E Problems 
6. A regression of consumption on income and wealth, with t-values in parentheses, is 
shown here. Are the independent variables significant at the 5 percent level? There we 
100 observations. 


Č=52+1131+46W O n 
a23 (087) 


7. A model regressing consumption (C) on income (7) and wealth (W) yielded the following 
results: 


F = 17.42 
C= 402 + 0.837 + 0.71W 
(0.71) (6.21) (5.47) 


| 
P 
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9. 


10. 
11. 
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where t-values are shown in parentheses. There were 25 observations in the data set. 
a. What is the meaning of the intercept term? 
b. Are the coefficients significant at the 5 percent level? 
c. Is the model significant at the 10 percent level? 
Below is the CBS printout for a regression of crop yield on rainfall, fertilizer, and crop 
acidity. 
a. What is the model? 
b. Do the data suggest that the independent variables are significant at the 5 percent 
level? 
c. What is the lowest level of significance for each variable? 
d. How might you improve the model? 
e. How strong is the relationship? 


Variable B-coeft stderr Beta tvalue p value 
rain 0.2273 0.1588 0.2508 1.4309 0.1802 
fert 1.1524 0.2772 0.7714 4.1571 0.0016 
acid =0.1113 0.1093 0.0935 -1.0181 0.3304 
BO Intercept: 3.2946 

Critical t: 2.2010 

C.0.D. (R-Squared): 0.9283 
Adjusted C.0.D. (R-Squared): 0.9087 
Multiple Correlation Coefficient: 0.9635 
Standard Error Estimate: 6.4987 


A marketing major at the local university uses data from 1965 to 1990 and regresses the 
demand for automobiles expressed in number of cars sold on price. She finds 


D =925 + 173P. r? = 793 
(t= 8.17) 


a. What does the model tell her? 
b. Does your answer in Part a seem logical, and how do you explain it? 


How might the marketing major from the previous problem improve her model? 
A manufacturer estimates his production function as rf 
Ô = 50 + 10L + 25K 

where Q is output, L is labor in people, and K is capital. 

a. Ifthe manufacturer employs 400 workers and uses 17 units of capital, what should 
output be? 

b. If labor costs $30 and capital costs are $57 per unit, should the manufacturer acquire 
more labor or more capital if he wants to increase output? 


A field of economics referred to as human capital has often held that a person’s income (/) 
could be determined on the basis of his or her (1) education level (E), (2) training (7), and 
(3) general level of health (H). Using 25 employees at a small textile firm in North 
Carolina, a researcher regressed income on the other three variables and got the following 
results. 


27.2 + 37E + 1T + 3.05H 


G0) (621) (432 (6.79) 
R =061 F=597 


~ 
i 
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14, 


15. 


16. 


A partial ANOVA table is 


17. 


18. 


000, E and T are measured in years, and H is measure, 


T is measured in units of $1, 
terms of a scaled index of one’s health: the higher the index, the better the level of hea, 
i 


a. If one’s education increases by two ycars, what happens to his or her income? 
b. Is the model significant at the 5 percent level? State the hypothesis, the decision n, 


and the conclusion. 

c. Determine which variable(s) is (are) significant at the 10 percent level. State th 
hypotheses, the decision rule, and the conclusion. 

d. What is the value of the adjusted coefficient of determination? 


What does it mean if the null hypothesis in a test for a single B; is not rejected? 

In reference to the previous problem, if Ho: B; = 0 is not rejected, according to the mody | 
what will happen to Y if X; changes by one unit? by two units? a 
Consider the following model with n = 30. 


Ŷ = 50 + 10X, + 80X, | 
R7=078 S,,=273 Sy, = 471 | 


Which variable(s) is (are) significant at the 5 percent level? State the hypothesis and th | 
decision rule, and draw a conclusion. | 
Economists have long held that a community's demand for money is affected by (1) lev 
of income and (2) interest rate. As income goes up, people want to hold more money w 
facilitate their increased daily transactions. As the interest rate goes Up, people choose w 
hold less money because of the opportunity to invest it at the higher interest rate. 

An economist for the federal government regresses money demand (M) on incom 
(D) and interest rates (r), where M is expressed in hundrds of dollars and J in thousands of 
dollars. The model is 


M = 0.44 + 5.491 + 6.4r 


Sum of Degrees of 
Source Squares Freedom 
Between samples 93.59 i 
Within samples 142 9 


a. According to the theory of the demand for money, are the signs of the coefficients 3 
expected? Explain. 

b. Test the entire model at a = 0.01. 

Given the conditions in the previous problem, if the standard error for the coefficient forl 
is 1.37 and that of r is 43.6, determine which variable(s) is (are) significant at the ! 
percent level? State the hypothesis, the decision rule, and the conclusion. | 
Below is a partial printout of a regression model with two independent variables. 
Determine the ¢-values. Are the variables significant at the 10 percent level? 


Variable B-coeff stderr Beta | 
x1 0.5072 0.1987 0.9319 | 
x2 0.0173 0.1036 0.0611 f 
BO Intercept: - | 
critteat t 23680 | 
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An economic analyst for IBM wishes to forecast regional sales (S) in hundreds of dollars 
on the basis of the number of sales personnel (P), the number of new business starts in the 
region (B), and some measure of prices. As a proxy for the last variable, she uses changes 
in the CPI. She then collects data for 10 sales regions and derives the following model and 
partial ANOVA table: 


$= —1.01 + 0.422P + 0.0918 — 1.8CPI 


Sum of Degrees of 
Source Squares Freedom 
Between samples 391.57 3 
Within samples 31.33 6 


a. Test the significance of the entire model at 1 percent. State the hypothesis, the 
decision rule, and the conclusion. 

b. Ifthe standard errors of the coefficients for P, B, and CPI are 0.298, 0.138, and 2.15, 
respectively, test each coefficient at the 10 percent level. State the hypothesis, 
decision rule, and conclusion in each case. 

c. How can you reconcile the findings from Parts a and b? 

Given the numerous coaching changes in the National Football League in 1989, a sports 

enthusiast tried to build a model that would predict the number of games a team would 

win (W) based on net rushing yards (a team's total yards rushing minus opponent's yards 
rushing) (RY), net passing yards (PY), and net points (NP). She randomly selected the 
following teams and collected her data. 


Team LA RY PY NP 
Bears 13 1,052 -34 80 
Eagles 10 182 —454 52 
Giants 10 —67 —60 55 
Redskins 7 —188 697 —42 
Chiefs 4 —879 697 —66 
Cowboys 3 134 —55 116 
Rams 10 321 545 103 
49ers 13 1,134 -50 129 


Her computer runs using Minitab produced the following results: 


W= 8.82 + 0.00629 RY - 0.00080 PY - 0.0243 NP . 


Predictor Coef Stdev t-ratio p 
Constant 8.820 1.577 5.59 0.005 
RY 0.006293 0.002704 2.33 0.081 
PY -0.000803 0.002961 -0.27 0.800 
NP -0.02433 0.02458 -0.99 0.378 
s=2.769 R-sq = 60.2% R-sq(adj) = 46.1% 


Analysis of Variance 
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Degrees of Sun of Mean Ẹ P 
SOURCE Freedom Squares Square 
Regression 3 68.836 22.945 2.99 RN 
Error 4 30.664 7.666 
Total 7 99.500 

W RY PY 

RY 0.785 
PY -0.478 -0.602 
NP 0.459 0.791 -0.599 


Analyze the printouts and comment on any findings you feel are particularly revealin 
A marketing specialist wishes to study output at production plants for his firm. He taj, 
measurements for output (0), worker-hours of labor (L), and an index for the level, 
technology employed in each plant (T). Output is regressed on L and T. The resultin 


21. 


equation is 
Ô = -0.475 + 1.778L + 0.30275T; F = 58.03 
with 
Xo = 195 ZOL = 678 
SL = 40 ETO = 7,245 
ET = 428 Sa, = 0.7251 
I =6317 Sa = 0.07374 
n= 12 


Test each coefficient at the 5 percent level. 

Test each coefficient at.the 10 percent level. 

Develop a measure of the dispersion of the average amount by which the ac 
observations vary around the regression plane. 

d. What is the coefficient of determination? 


The following problems can be more easily done on a computer. 

22. A management director is attempting to develop a system designed to identify whol 
personal attributes are essential for managerial advancement. Fifteen employees who 
have recently been promoted are given a series of tests to determine their communicatiol 
skills (X,), ability to relate to others (X), and decision-making ability (X3). Ed 
employee's job rating (Y) is regressed on these three variables. The original raw data a 


os 


4 
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Y X X X% 
80 50 72 18 
75 51 74 19 
84 42 79 22 
62 42 n 17 
92 59 85 25 
75 45 73 17 
63 48 75 16 
69 3 73 19 
68 40 7 20 
87 55 80 30 
92 48 83 33 
82 45 80 20 
74 45 75 18 
80 61 75 20 
62 59 70 15 


a. Develop the regression model. Evaluate it by determining if it shows a significant 
relationship among the dependent variable and the three independent variables. 

b. What can be said about the significance of each X;? 

To what cause might you attribute the insignificance of X, and X, in previous problem? 

Obtain the correlation matrix for these variables and test each pair for multicollinearity. 

Set a = 5 percent. 

Compare your results in the previous problem to those obtained based on VIF. 

Should the management director in the problem above use this model to identify 

characteristics that made an employee eligible for advancement? 

As a class project, a team of marketing students devises a model that explains rent for 

student housing near their university. Rent is in dollars, SQFT is the square footage of the 

apartment or house, and DIST is distance in miles from house to campus. 


Rent SQFT DIST 
220 900 3.2 
250 1,100 22. 
310 1,250 1.0 
420 1,300 "05 
350 1,275 15 
510 1,500 05 
400 1,290 15 
450 1,370 05 
500 e 1,400 05 
550 1,550 03 
450 1,200 05 
320 1,275 15 


Scanned with CamScanner 


734 Chapter Fourteen Multiple Regression and Correlation 


a. Devise the model. Is it significant at the 1 percent level? 
b. Evaluate the significance of both coefficients. 
c. Are the signs appropriate? Explain. 

27. Evaluate the model from the previous problems. Does it appear useful in predicting rey 

Explain. 

Is there evidence of multicollinearity in the model from the previous problem? Dogs ; 

invalidate the model for predicting rent? Why or why not? 

29, From the model developed above for student rents, can you conclude distance froy 
campus is a stronger determinant of rent than is square footage? Why or why not? 

30. If two apartments have the same space, but one is 2 miles closer to campus, how will iy 
rent differ from that of the more distant dwelling? 

31. In order to expand their model on students’ rents, the marketing majors from the probleng 
above devise a luxury index in which students rate the amenities of an apartment based o 
available comforts, such as swimming pools, tennis courts, maid service, and othy 
luxuries to which students are traditionally accustomed. For the 12 observations above, 
this index measured 22, 23, 35, 40, 32, 55, 36, 41, 51, 50, 48, and 29. Incorporate the 
variable in your model to explain rents. Analyze and explain why you got these results. ls 
your model better with this additional variable? What problem are you likely encounter. 
ing, and what change would you make to correct it? 

32. Make the change you suggested in the previous problem and discuss your results. 

33. In the past, many economists have studied the spending patterns of consumers in the 
economy. A famous study by Milton Friedman concludes that consumption is a function 
of permanent income, which is defined as the average level of income the consumer 
expects to receive well into the future. The habit-persistence theory of T. M. Brown 
argues that consumption is shaped by a consumer's most recent peak income—the 
highest income received in the recent past. 

To combine these two theories, an economist collected data on consumption 
(CONS), permanent income (PERM), and peak income (PEAK), and performed OLS to 
devise a model. Given these data, what did that model look like? (All values are in 


thousands of dollars.) 
CONS PERM PEAK 
12 15 7 H 
2 28 31 | 
15 19 21 
17 19 24 
19 24 27 
4 17 20 X 
20 25 29 ii 
17 21 25 ji 
15 19 22 
16 20 26 


a. Evaluate the model. 

b. Would multicollinearity explain the insignificance of PEAK? How can you tell? 
34. From the previous problem, run two simple models using each form of income. What cao 

you conclude about cach type of income as a predictor of consumption and the problem 

multicollinearity in the multiple regression model? 
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35. Consider the multiple regression model in Problem 33. 
a. What does the fact that b; and b, are less than 1 tell you? 
b. If PERM goes up by $1 and PEAK remains constant, by how much will CONS 
change? 
c. If PEAK goes up by $1 and PERM is held constant, by how much will CONS 
change? 

36. Studies in finance have shown that the price of a share of stock (P) is directly related to 
the issuing company’s level of debt (D) and to the dividend rate (DR), but is inversely 
related to the number of shares outstanding (SO). The data shown here are in dollars for P, 
hundreds of thousands for D, in dollars for DR, and in thousands of shares for SO. 


p D DR so 
52.50 12.00 2.10 100 
14,25 3.40 0.69 37 
35.21 7.40 1.70 68 
45.21 10.40 1.81 90 
17.54 4.00 0.70 32 
22.00 5.10 0.88 45 
37.10 8.50 1.50 78 
29.12 6.70 1.20 60 
46.32 10.65 1.85 96 
49.30 11.34 2.00 99 


a. Devise the OLS model. Does it support the theoretical relationships expressed above? 
What would happen to P if 

b. D increased-by $100,000? 

c. DR decreased by $0.50? 

d. SO increased by 500 shares? 
37. Given the data shown, calculate the residuals for each observation. Regress Y on X,, X2, 
and the residuals. Before doing so, what should the values for Se, R?, and R? be? Perform 
the regression and see if you were right. Why did you get that value for R? and Se? 


Y X x, 
12 25 26 
23 35 + 36 
98 87 86 
65 56 54 
36 23 4 
65 45 52 
78 69 59 
21 25 36 
54 65 49 
25 36 24 


38. The members of anational social organization at the local university got the bright idea of 
selling T-shirts at the campus concerts by Jimmy Buffet, a singer well known for his laid- 
back, tropical Caribbean theme. The shirts would cost $5.20 each plus $1.75 to imprint a 
picture of Buffet and the fraternity letters. They hoped to sell them for just under $10. 
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39. 


d South Carolina, it became Painfuy, 
dging the market. They either foy 
ly fell far short of demand, 


After a few concerts at campuses in Georgia ani 
obvious that the fraternity was having a problem ju 
themselves with a surplus after the concert, or supp! 
statistics professor suggested that they develop a model to forecast demand. From the 
eight concerts, data were collected on the number of T-shirts sold or the estimate of 
could have been had there been sufficient numbers (DEMAND), enrollment at thy 
university's main campus (E), and advanced ticket sales (7). The data are shown here; ¢ 


is in thousands, and T is in hundreds. 


cic a oe te he 
DEMAND E T 
220 120 12.0 
520 32.0 25.0 
800 47.0 42.0 
z 450 24.0 22.0 
375 226 18.5 
250 150 12.9 
350 21.0 17.5 
650 39.0 325 


a. Does it appear that the model has potential to accurately predict demand? 

b. What would happen to demand if E was 2,000 higher, assuming T remains constant? 
c. What would happen to demand if 7T was 50 higher, assuming E is held steady? 

A specialty store emphasizing fashions for the successful businessperson is trying to 
identify variables that can explain the level of purchases by customers. Data are collected 
for dollar sales each visit (S), income of customer (/) in thousands, and years of 
experience the customer had with his or fier present employer (E). It was felt that the last 
variable measures how high the customer has risen on the business success ladder and 
thus reflects his or her need for fine attire. 


S I E 
630 107 12 ‘ 
550 95 12 
320 54 6 
820 141 16 
450 76 10 
755 130 15 
750 127 14 
330 55 9 
1,020 174 2 
655 110 13 
420 127 8 
545 94 W 
740 125 15 


After collecting the data, the store manager finds himself to be statistically illiterate and 
has no idea what to do with them, How can you help? 


Based on the results of the previous problem, if the manager has the opportunity to serve 8 
customer with one more year’s experience or one with $1,000 more in income, which 
should he accommodate? 


b 
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41. Does your answer to the previous probem identify that variable which is more important 
in explaining sales? 

42, A computer run produces the following printout with a few of the important statistics 
omitted. Complete the printout by-providing the missing vales. Identify the significant 
variables and their respective p-values. 


Multiple R -997 
R Square .995 
Adjusted R Square 
Standard Error 26.040 
Analysis of Variance 
DF Sum of Squares Mean Square 
Regression 2 594595.268 297297 643 
Residual 4 2712.446 678.111 
F= Signif F = .0000 
Variable B SEB Beta T SigT 
x2 -3.867 «693 —.204 -0051 
x1 9.901 -342 1.060 -0000 
(Constant) 957.828 55.435 17.278 -0001 


End Block Number 1 ALL requested variables entered. 


CORR VARI ALL. 


Correlations: | Y x1 x2 

y 1.000 grt -214 

x1 «979%* 1.000 -394 

x2 -214 2394. 1.000 

N of cases: 7 1-tailed Signif: *- .01**— .001 


" is printed if a coefficient cannot be computed 


Consider the printout from the previous problem. Perform all tests for multicollinearity. 
State and defend your conclusions. 


è 


Empirical Exercises 

44. For years your teachers have told you that there is a link between studying and grades. 
From a survey of your fellow students willing to reveal their academic success, or lack 
thereof, collect data on grade point average (GPA), typical number of hours per day or 
weck spent studying, number of carned credit hours, and any other factor you think may 
help explain grades. Using GPA as the dependent variable, develop a regression model 
based on these variables. Does it appear that the time-honored relationship between effort 
and grades, as well as the other variables you have chosen, does exist? 

45. From Federal Reserve publications or other appropriate sources, collect data‘on personal 
consumption expenditures, the money supply, and some measures of income at the 
national level. Does it appear, as economic theory attests, that consumption is a function 
of income and the amount, of liquid assets available in the economy? 

46. Formulate a theoretical model using variables pertaining to your major or intended 
profession. If-you are an economics major, a model such as that specified in Problem 45 
would be appropriate. A finance major may want to investigate the relationship between 
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stock valuation, debt levels, and dividend rates. Be sure you use a multiple regressi, 
_ model, Collect the necessary data to analyze the OLS model. Docs it suppon y, 
- theoretical principles upon which the model was based? 


Minr-Gase APPLICATIONS 
1. John Baker owns a trucking firm that contracts with the local parts authorities in Kans, 
City to haul goods brought in by barge on the Missouri River. Baker picks up the cargo 
the docking area every moming and delivers it throughout the city. Since the amount q 
cargo he is expected to haul varies considerably from one day to the next, it has bem 
Baker's practice to hire day-workers as the need arises and to rent trucks from a local fing 
when his small fleet is insufficient. For some time now, Baker has wanted to devise, 
scheme to better estimate the volume of cargo he finds waiting for him each moming, x 
well as the number of workers and trucks he will need to complete the job. Baker's som, 
Bill, has a degree in management and took several statistics courses. Bill feels that the 
ability to handle cargo could best be explained by the number of day-workers hired and the 
number of available trucks. John fears that his money might have been wasted. It doesn't 
take a college degree to figure out that the amount of goods John can carry depends on the 
resources he has ready for use. Besides, this still leaves the problem of estimating the cargo 
that awaits on the dock for citywide dispersal each morning. 

It is here that young Bill impresses his father. Perhaps the morning cargo could best be 
estimated by noting the shipping tonnage that passed through the lock upriver the previous 
day, Part of that tonnage will be off-loaded on Baker's dock. In addition, the amount of 
cargo might also depend on the number of dock workers available to unload it. 

With this in mind, the younger Baker collects data for variables. Twenty days at 
randomly selected from the past six months, Bill records the tons transported by his 
father’s trucking company (77), as well as the number of day-workers (DW) and trucks 
(TR) employed by his father on those days. A second random sample of 15 days is takea 
from the same six-month period. The tonnage that passed upriver the previous day, (TON, 
the number of dock workers (DOCK), and the volume of cargo off-loaded and awaiting 
transportation (CARGO), are recorded. These data are shown here: 


TT DW TR TON DOCK CARGO 
52 25 7 60 7 42 
7 24 8 67 8 45 
37 19 5 89 13 60 
68 33 9 98 15 6 
59 28 8 90 13 62 
63 30 8 4 11 5A | 
4 36 10 69 o. 48 
55 24 7 74 "1 55 
36 16 5 81 12 56 
49 24 7 ral 11 50 
1 34 9 BA 13 bed 
65 31 9 73 12 52 
60 28 8 94 14 65 
A a : 14 12 52 
a s ? 86 13 58 
"n 32 9 

45 22 6 

51 24 7 

47 2 5 


d 
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a. Develop the two OLS models described above and determine if Bill has assisted his 
father in planning for his business operations. 

b. If DOCK = 14 and TON = 72, what is the amount of CARGO the Bakers can expect? 
Given the values in part b, how many day-workers will John need if he has seven 

trucks? 


APPENDIX It may be beneficial to examine these hand calculations that lead to a solution for the regression 


equation. In so doing, you can gain an appreciation for exactly what calculations must be 


D 


x performed to complete a regression program, thus acquiring further insight into the nature of 
boos regression analysis. The data needed to solve for the Hop Scotch equations are shown in Table 
14-7. 
2 YX, YX, 
Computations for Y Ao a X% -Xi % 2 ati 
Hop Scotch Airlines 15 10 2.40 24.00 100 5.7600 150 1 

17 12 2.72 32.64 144 7.3984 204 46.24 
13 8 2.08 16.64 64 4.3264 104 27.04 
23 17 3.68 62.56 289 13,5424 391 84.64 
16 10 2.56 25.60 100 6.5536 160 40.96 
21 15 3.36 50.40 225 11.2896 315 70.56 
14 10 2.24 22.40 100 5.0176 140 31.36 
20 14 3.20 44.80 196 10.2400 280 64.00 
24 19 3.84 72.96 361 14.7456 456 92.16 
7 10 2.72 27.20 100 7.3984 170 46.24 
16 11 2.07 22.71 121 4.2849 176 33.12 
18 13 2.33 30.29 169 5.4289 24 41.94 
23 -16 2.98 47.68 256 8.8804 388 68.54 
15 10 1.94 - 19.40 100 3.7636 150 29.10 

“Age * Sia 2 ™ aah? 26.04 144 4.7089 192 3472 

268 187 40.29 525.38 2,469 113.3387 3,490 746.62 


Using these data, Ace can then solve the system of normal equations for the intercept by 
and coefficients b, and b. This is done by eliminating any one of them from the equations. 
This involves several steps, outlined below. For instance, suppose Ace decides to eliminate bo. 
Step 1 requires that he get the coefficient of by to be the same in any pair of equations. But first, 
let us restate the normal equations from (14.8) to (14.7). 

EY = nbo + b, EX, + b,EX, 
IXY = boZX, + bZXZ + b,2X,X, 
IXY = bo?X, + b;ÈX,X, + bX 
For the data in Table 14-2, these equations become 


268 = 15b + 187b, + 40.29b, 
3,490 = 187bo + 2,469b, + 525.38b, 
746.62 = 40.29b + 525.38b, + 113.346, 
Step 1: Eliminate bọ using [1] and [2]. 
Multiply [T]by 187 to obtain [4] and multiply [2] by 15 to obtain [5]. Then subtract [5] from [4]. 


Scanned with CamScanner 


740 Chapter Fourteen Multiple Regression and Correlation 
187 x [T] = 50,116 = 2,805bo + 34,969b, + 7.534.23b, [4] 
15 x B) = 52,350 = 2,805bo + 37,035, + 7,880.7b, [5] 
0l- B = -2,234 = 2,060, - 346.47b, [6] 


Step 2: Eliminate bọ using [2] and [3] or [and Bi). 
Multiply [2]by 40.29 to obtain[T]and multiply [3] by 187 to obtain [8]. Then subtract [B] trom) 


[2] = 140,612.1 = 7,534.23b) + 99,476.01b, + 21,167.560b, 
[3] = 139,617.94 = 7,534.23b + 98,246.06b, + 21,194.58, 


40.29 x 


187 x 
= 994.16 = $1,229.95b, — 27.0198b, [9] 


Step 3: Use [6] and [9] to eliminate b,. 
Multiply [6] by 1,229.95 to obtain [10] and multiply [9] by 2,066 to obtain [E]. Then ada fp 


and [11]. 
1,229.95 x [6] = —2,747,708.3 = —2,541,076.7b, — 426,140.7765b. 
2,066 x [B] = 2,053,934.56 = 2,541,076.7b, — 55,822.9068b2 fn] 
+ [11] = -693,773.74 = —481,963.68336, [12] 


Step 4: Obtain values for bo, b, and bz. 
From [12], b2 = 1.439473064. 


From 6. 
—2,234 2,066b, — 346.47(1.439473964) 
2,234 = —2,066b, — 498.7342325 
b, = 0.839915666 
From D 
268 = 15b + 187(0.839915666) + 40.29(1.439473064) 
bo = 3.529293371 
Thus, 
bo = 3.529293371 ~ 3.53 
b, = 0.839915666 =~ 0.840 
bz = 1.439473064 = 1.44 
and 


Ê = 3.53 + 0.84X, + 1.44x, 


A 
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Calculations for the standard error of the estimate also reveal what this statistic measures- 


Pass Adv. NI . Predicted Y Error Error Squared 
00 10 240 15.000 — 0.00000 8.00000 = F(Y, — Vy 
Yı X Xe Ý -Y m- y? 

15 10 240 15.383 — 0.38338 0.14698 
17 12 272 17.524 -0.52382 0.27438 
13 8 2.08 13.243 —0.24294 0.05902 
23 7 368 23.105 —0.10547 0.01112 
16 10 2.56 15.614 0.98607 0.14905 
21 15 336 20.965 0.03497 0.00122 
4 10 224 15.153 -1.15282 1.32900 
20 14 a2 19.895 0.10519 0.01106 
24 19 384 25.015 -1.01536 1,03095 
17 10 272 15.844 1.15851 1.33521 
16 " 2.07 15.748 0.25248 0.06375 
18 13 23 17.802 0.19850 0.03940 
23 16 2.98 21.257 1.74287 3.03761 
15 10 1.94 14.721 0.27947 0.07810 
16 12 217 16.731 —0.73128 0.53477 
8.10162 = F(Y, — ý 
Then 
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Variables and 
Residual 
Analysis: 
Extensions of 
Regression and 
Correlation 


A Preview of Things to Look For 

1. How dummy variables can be used to incorporate qualitative measures ina 
regression model. 

2. The ways to use residual analysis to evaluate a model. 

3, The detection and impact of autocorrelation. ~ 


4, The role played by the Durbin-Watson statistic in residual analysis. 


5, The use of logarithms and polynomials to develop curvilinear models. 
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CHAPTER BLUEPRINT 


Chapters 13 and 14 thoroughly described the many ways regression and 
correlation can be used to develop solutions to common business problems. 
This chapter offers a few finishing touches to our study of regression and 
correlation. It explores some important features not examined earlier. 
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Statistics show that it’s often easier to do something right 
than to explain why you did it wrong. 


INTRODUCTION 


By now you should be well aware of the power and versatility of regression any 
correlation analysis. Throughout the last two chapters we have witnessed how they 
vital tools of statistical analysis can address many business problems. 

However, we have not yet examined several pertinent factors related to regressioy 
and correlation. We will now explore these statistical features that greatly enhance th 
usefulness of these two powerful instruments of statistical inquiry. In addition, we wil 
investigate ways to evaluate and improve regression models, Our attention will focus 
primarily on - 

e Dummy variables. 
Autocorrelation. 
Residual analysis. 
Heteroscedasticity. 
Curvilinear analysis. 


H Dummy VARIABLES 


In your research efforts you may find many variables that are useful in explaining the 
value of the dependent variable. For example, years of education, training, and 
experience are instrumental in determining the level of a person’s income. These 
variables can be easily measured numerically, and readily lend themselves to statisti- 
cal analysis. 

However, such is not the case with many other variables that are also useful in 
explaining income levels. Studies have shown that gender and geography also cary 
considerable explanatory power. A woman with the same number of years of educa- 
tion'and training as a man will not have the same income. A worker in the Northeast 
may not earn the same as a worker in the South doing a similar job>Both gender and 
geography can prove to be highly useful explanatory variables in the effort to predict 
one’s income. However, neither variable can readily be expressed numerically, and 
cannot be directly included in a regression model. We must therefore modify the form 
of these nonnumeric variables so we can include them in our model and thereby gain 
the additional explanatory power they offer. 

Variables that are not expressed in a direct, quantitative fashion are called 
qualitative variables or dummy variables. As another illustration, the sales of a firm 
may depend on the season. Swimwear probably sells better in the spring than it does in 
the fall or winter. More snow shovels are sold in December than in July. This seasonal 


b 
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factor can only be captured by taking into account the time of year (fall, winter, spring, 
or summer), a variable that cannot be measured numerically. Whether a person is 
married, single, or divorced may affect his or her expenditures for recreational 
Purposes, while one’s place of residence (urban, suburban, or rural) will likely impact 
on their tax assessment. In all these cases, the variables we wish to measure cannot 
readily be expressed numerically. We must use dummy variables to obtain a more 
complete description of the impact of these nonnumeric measures. 


As the regional manager for a department store chain, you wish to study the 
relationship between mean expenditures by your customers and those variables you 
feel might explain the level of those expenditures. In addition to the logical choice of 
income as an explanatory variable, you feel that a customer’s sex may also play a part 
in explaining expenditures. You therefore collect 15 observations for these three 
variables: expenditures in dollars, income in dollars, and sex. 

But how do you encode the data for sex into the model? You cannot simply 
specify M or F for male and female, because these letters cannot be manipulated 
mathematically. The solution is found by assigning values of 0 or 1 to each observa- 
tion based on sex. You might, for example, choose to record a 0 if the observation is 
male and a 1 if the observation is female. The reverse is equally likely. You could just 
as well encode a 0 if female and a 1 if male. (We will examine the effects of this 
alternate coding scheme shortly.) 

Suppose you chose to record a 0 if the observation is male and a 1 if it is female. 
The complete data set for n = 15 observations is shown in Table 15-1, with Y in 
dollars and X, in units of $1,000. Notice that X, contains only values of O for male and 


1 for female. 
Observation Expenditures (¥) Income (X;) Sex (X) 
Data for Study of al 
Customer's 1 0 
2 30 25 0 
Expenditures 3 32 7 0 
4 45 32 1 
5 51 45 1 
6 3 29 0 
T 50 42 1 
8 47 38 1 
9 45 30 0 
10 39 29 1 
1 50 4 1 
12 35 23 1 
13 40 36 0 
14 45 42 0 
15 a4 50 48 0 
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Using the OLS procedures discussed in the previous two chapters, the regressio 
equation is 
Î = bo + bX, + bX 
= 12.21 + 0.791X, + 5.11X, 
(0.000) (0.010) 


The p-values are shown in parentheses. , 1 
The use of a dummy variable for sex will actually produce two regression lin 


one for males and one for females. These lines have the same slope but differen, 
intercepts. In other words, the equation gives two parallel regression lines that start a 
different points on the vertical axis. Since we encoded a 0 for males, the equation 


becomes 
Ê = by + bX, + bX, 
= 12.21 + 0.791X, + 5.11(0) 
= 12.21 + 0.791X, 


for males. This line has an intercept of 12.21 and a slope of 0.791, and is shown in 


Figure 15-1. 
Expenditures Y= 17.23+0.791X, 
Regression Lines for ast --------------25 (Females) 


Customers 

CO ata ian 1 $=1221+0.91X, 
1 (Males) 
1 

1732 l 
t 

1221 1 
i 
30 Income 


For females, the encoded value of | produces 


Ê = 12.21 + 0.791X, + 5.11) 
= 17.32 + 0.791X, 
This second line has the same slope as the line for males, but has an intercept of 


Ta Since X, = 1 for females, the intercept was determined as bọ + bz = 12.21 + 
LL = 17.32. 


This means that for any given level of income, women customers spend $5.11 
more on the average than do men. Let income equal 30 ($30,000). Then for women 


P= 1221 + 0.791(30) + 5.11(1) 
= 41.05 


A 
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and for men 


1 


= 12.21 + 0.791(30) + 5.11(0) 
35.94 


The difference of $5.11 occurs because the encoded value of O for males cancels out 
the b, coefficient of 5.11, while the encoded value of 1 for females results in the 
addition of 5.11 to the equation. 

The p-value of 0.010 tells you that the coefficient of 5.11 for sex is significant at 
the 1 percent level. However, if the p-value was not given, you should test the 
hypothesis that it differs significantly from zero. That is, 


Using SPSS-PC the standard error of B, was estimated to be 
Sa; = 1.672 
Then 
by 
Soo 
_ 5. 


~ 1.672 
= 3.06 


If a = 0.05, foosn--1 = toos,2 = 2.179. 


Decision Rute Do not reject if —2.179 < t < 2.179, Reject if t < —2.179 
ort > 2.179. 


The t-value of 3.06 results in a rejection of the null. It is concluded at the 95 percent 
level of confidence that a signficant difference exists between expenditures by men 
and women customers. 

If you had encoded the dummy variable by assigning a 1 for a male observation 
and a 0 for a female observation, the results would be the same. A computer run shows 
the equation to be 


Ê = 1732 + 0.791X, — 5.11X, 


For females, we have 


Ý = 17.32 + 0.791X, — 5.11(0) 
= 17.32 + 0.791X, 
and for males 
Ê = 17.32 + 0.791X, — 5.11(1) 


12.21 + 0.791X, 


Encoding the dummy variable either way yields the same results. 
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: 
Expenditures 


Scatter Diagram for 
Expenditures 


For females 


For males 


If the data were put into a scatter diagram, they might appear as in Figure 15-2. In 
an extreme case, there could appear two almost totally separate diagrams, one for the 
male observations and one for the females. If the dummy variable was ignored and 
only one line was fitted, its slope would be much steeper than the other two, such as 
the line identified as Y*. The effect attributed to income alone by the single regression 
line should be partially ascribed to sex. 

If a dummy variable has more than two possible responses, you cannot encode it 
as 0, 1, 2, 3, and so on, A variable with r possible responses will be expanded to 
encompass a total of r — 1 variables. For example, you might include a third variable 
in your model to study the effect of marital status on expenditures. Your possible 
responses might include married, single, divorced, and widowed. In addition to X, for 
income and X, for sex, these four possible responses require three additional variables, 
X;, Xq, and Xz, to encode the data on marital status. This is done by entering only a0 
or a 1 for each variable in the following manner: 


X;=1 if married 

= if not married 
X,=1 if single 

=0 if not single 
X,=1 if divorced 

=0 if not divorced 


No entry for widowed is necessary, because if X, = X, = X, = 0, the process of 
elimination reveals the observation to be widowed. as 

Assume 0 is encoded for male and 1 for female in X}. The three observations 
(OBS) shown here are for a (1) married male with expenditures of 30 and income of 
40, (2) a divorced female with expenditures of 35 and income of 38, and (3) a 
widowed male with expenditures of 20 and income of 45. 


ops Y x, % X% X X% 
1 30 40 0 1 0 0 
2 35 38 1 0 0 1 
3 20 45 0 0. 0 0 
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For example, in the first observation, X, would be 0 since the observation is male, and 
X; is 1, while both X, and X; are 0 since the observation is married. 


15.2.2 


A salesman wishes to regress sales (Y) on (1) advertising and (2) the state in which 

the sales were made. His territory covers Iowa, Indiana, Ohio, and Illinois. 

a. How may dummy variables will he need? 

b. Identify them, using X}. 

c. How would you encode an observation of 20 in sales in Iowa, 30 in Ohio, 40 in 

Illinois? 

Answer: a. 3 
b. One possibility is X, = advertising, X, = Iowa, X, = Indiana, X4 = 
Ohio. 
a Xi 


If the equation from 15.2.1 proved to be Y = 20 + 40X, + 20X, + 30X, — 20X,, 
what would sales be 
a, in Iowa if X, = 107 
b. in Ilinois if X, = 5? 
Answer: a. S = 20 + 40(10) + 20(1) + 300) — 20(0) = 440 
b. S = 20 + 40(5) + 20(0) + 30(0) — 20(0) = 220 


regression model. As noted earlier, a good regression exhibits purely random. errors 
that are normally distributed with a mean of 0 and a variance of a°, If an examination 
of these residuals reveals conditions to the contrary, this may suggest that there are 
problems inherent in the model. The detection of any pattem of correlation in the error 
terms could mean that some of the basic assumptions regarding the OLS model are 
being violated. The remainder of this chapter is devoted to an examination of error 
terms, and to an analysis of the problems that may be detected from such an 
examination, We concentrate primarily on the principles of autocorrelation and 
heteroscedasticity. 


Autocorrelation 


One of the basic properties of the OLS model is that the errors be uncorrelated. The 

error in prediction that you suffer at one point in time is not linearly related to the error 

that you might suffer at another point. Ideally, if you were to graph your errors over 

time, they should appear as in Figure 15-3. There is no detectable pattern in the errors. 

The error terms appear to be independent and offer no indication of any relationship 
. with each other. i 
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FIGURE 1 


Absence of 
Autocorrelation 


However, as noted in Chapter 13, when dealing with time-series data this condi. 
tion is often violated. We find that errors can be correlated, resulting in autocorrela. 
tion (AC). Over time, many economic series such as unemployment, GNP, or interes 
rates move cyclically. If a series is unusually low (high) relative to its long-run mean 
in one month, it is probably still low (high) in the next month, Corrections are simply 
not made overnight. A regression model is based on a series long-run average. Ifa 
series is unusually low, the regression model will likely overestimate its value, This 
overestimate will produce a negative error since e = Y, — Y. Since the series is 
probably still unusually low the next time period, another negative error is to be 
expected. The reverse is true when the series cycles to an unusually high level, 
Positive errors will be generated for several successive time periods. This pattem of 
successive negative errors, followed by several positive errors, is evidence of autocor- 
relation. 

Figure 15-4 illustrates autocorrelation. There is a distinct patem in the emor 
terms. Several successive negative errors initiate the pattern, followed by several 
positive errors, and, in turn, several more negative errors. (In practice, don’t expect the 
pattern to be so obvious.) 


z 


Presence of Positive oo . 
Autocorrelation . s 


The correlation between error terms can be measured just like the correlation 
between any two variables in the model. The correlation between an error in one time 
period f and the previous time period, 1 — 1, is expressed as Pe e,_, Where the 
parameter p is the population correlation coefficient for error terms. Like all parame- 
ters, it is estimated by its corresponding statistic using the sample data. This correla- 
tion between errors at the sample level is measured by r, the same sample correlation 
coefficient that we have used in measuring correlation between any two variables in 


A 
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our model. In Figure 15-3, where AC is absent, we would estimate the correlation 
among error terms using r,,.,_,, to be zero. However, Figure 15-4 suggests that an 
error is likely to be followed by another error of the same sign. Thus, Fee, > 0. There 
is said to be positive AC. If the errors tended to alternate in sign as in Figure 15-5, 


negative correlation would be present, and ree, , < 0. 
Presence of `~ % . 
Negative 
Autocorrelation o Jan Feb Ma Am My Jun__ time 
. 
i . 


In the presence of AC, all the hypothesis tests and confidence intervals that we 
examined in Chapters 13 and 14 are rendered less reliable. This makes autocorrelation 
an extremely harmful condition. 

You can build a model to examine the error in your original model. If € is the 
error, then the model relating the error in one time period to that in the next is 


k © = Peay + W [15.1] 


where p is the correlation between errors in the original model, and is the random 
error term in the prediction of those errors; that is, it measures the error we suffer in 
trying to estimate the error in our original model. The term pp often called white 
noise, occurs because errors in the original model are not perfectly correlated. There 
will therefore be some error in our attempt to predict the error in our original model. 

Figure 15-6 also depicts error pattems that can reveal information about the 
model by plotting e, against e,_,. 


Ficure 15-6 


Detection of 
Autocorrelation 


Scanned with CamScanner 


752 Chapter Fifteen Dummy Variables and Residual Analysis 


In Figure 15-6(a), positive autocorrelation is present because where e, is Positive 
e,ı is also positive, and where e, is negative e,_; is also negative. Consecutive e 
have the same sign. The error relationship is contained in the two positive quadrants of 
the axes. This would suggest Pee, > 0- 

In Figure 15-6(b) the error terms are restricted to the two negative quadrant, 
evidencing negative correlation. That is, e, and e,_, tend to take on opposite signs, 
indicating p,,¢_, < 9. y hes 

Although analyzing errors can be a means of detecting autocorrelation, it is no 
very reliable. Patterns are seldom as obvious as suggested here. We Tequire a less 
fallible procedure, and find one based on the Durbin-Watson statistic d. The Durbin. 
Watson statistic is used to test the hypothesis of no autocorrelation: 


Ho! Pee, = 9 MO autocorrelation 
#0 Autocorrelation present 


It is calculated as 


i Xle, — e1)? 
Xle)? [15.2] 


Using our data from the study on consumer expenditures, Table 15-2 provides the 
necessary calculations. Note that 0 £ d = 4. As a general rule, if d is close to 2, 
assume that autocorrelation is not a problem. However, it is advisable to determine if 
the value actually found using Formula (15.2) is significant by testing the hypothesis 
that p = 0. Then z 


The critical values to which we will compare d = 2.03 are found using two 
values: the number of independent variables, k, and the number of observations, n. In 
our example, k = 2 and n = 15, If a = 0.05, Table K gives d, = 0.95 and dy = 1.54. 
A simple scale can now be constructed, as. in Figure 15-7, to determine if the null 
hypothesis of no autocorrelation is rejected or not rejected. 


a ` 
Durbin-Watson Test is Test 
Statistic +AC | inconclusive No AC inconclusive | -AC 
d, dy 2 4-d 
7 4-4; 
0.95 154 2.46 305" 


“~~ 
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Calculation of the ons.. Yi r (i er- ey (er - e? 
Durbin-Watson 1 51 49.3359 2.76912 * > 
Statistic 2 30 30.3784 0.14318 —2.04246 41716 
3 32 32.1138 0.01295 0.26459 0.0700 
4 45 42.3943 6.78980 2.71953 7.3958 
5 51 53.6745 7.15280 —5.28020 27.8805 
6 31 32.9815 3.92638 0.69296 04802 
7 50 51.0714 1.14779 0.91016 0.8264 
8 47 46.7328 0.07139 1.33854 1.7917 
~ 9 45 42.5263 6.11922 2.20652 4.8687 
10 39 39.7912 0.62592 — 3.26485 10.6593 
"1 50 50.2180 0.04752 1.00915 1.01838 
12 35 34.5940 0.16483 0.18800 0.03534 
13 40 39.9380 0.00384 0.34400 0.11834 
14 45 45.1460 0.02132 —0.20800 0.04326 
15 50 50.3540 0.12532 —0.20800 0.04326 
29.12203 59.40478 


If dy < d < 4 — dy, there is no evidence’ of autocorrelation and the null is not 
rejected; d< d, evidences positive AC; d > 4 — d, indicates negative AC. The two 
inconclusive regions arise because the distribution of d depends on the characteristics 
of the interrelationships among the independent variables. No generalization of these 
characteristics can be broad enough to unambiguously restrict the d-value. 

We calculated d to be 2.03, so the null is not rejected. It would appear that 
correlation between error terms is not a problem. 

The calculations are rather tedious. They can be simplified by estimating the 
value for d as 


d=(1 =r) [15.3] 


where r is the correlation coefficient between e, and ¢,_;. Still, if the computations 
must be done by hand, a good deal of arithmetic is necessary. Luckily, most computer 
programs report the Durbin-Watson value. 


} Quek CHeck 9 15.3.1 Inamodel with k = 3 and n = 30, dis calculated to be 3.12. At the 5 percent level, 


is autocorrelation present? Explain. 
Answer: Yes, d, = 1.21; dy = 1.65; 3.12 > 4 — dy, Negative AC is present. 


15.3.2 Iree, = 0.75, is AC present at the 1 percent level if k = 4 and n = : 35? Explain. 
Answer: d = 2(1 — 0.75) = 0.5; d, = 1.03; dy = 1.51; 0.5 < dy, Positive AC is 
present. 


Scanned with CamScanner 


754 Chapter Fifteen Dummy Variables and Residual Analysis 


i | Heteroscedasticity 


In addition to any absence of correlation in errors, another basic property of the OLs 
model is homoscedasticity. In Chapter 14 we defined homoscedasticity as a constam | 
variation in the error terms. The variation in errors experienced when X is equal tọ | 
some value, say 10, is the same as the variance in errors when X equals any other 
value. In Figure 15-8(a), as shown by the two normal curves, the distribution of the 
Y,-values above and below the regression line is the same atX = 10 as itis atX = 1], 
Thus, the errors that are represented by the difference between these Y-values and the 
regression line are normally distributed. This indicates the presence of homo. 


scedasticity. 3 at Pa 4 


Distribution of 
Errors 


| 
| 
i 
| 
| 


If the variance in errors is not the same for all values of X, heteroscedasticity 
occurs. Figure 15-8(b) shows that as X increases, the variance in error terms becomes 
more pronounced. The normal curve at X = 11 is more spread out than the curve at 
X = 10, indicating greater dispersion in the error: 


| 
| 
| 

Heteroscedasticity is common with cross-sectional data. Cross-sectional data are 
often used, for example, in investigations of consumer spending habits. In such 
studies, data are typically collected for consumption and income across income levels 
that encompass the poor, the rich, and those in between. This constitutes a cross- 
sectional data set because it cuts across different income groups. As might be 
expected, the rich display a behavioral model with respect to their consumption 
pattern that is different from the rest of us. This difference causes a variation in error 
terms that evidences heteroscedasticity. 

In the presence of heteroscedasticity the regression coefficients become less 
efficient. That is, there is an increase in the variance of the b-values. The b-value 
obtained with one sample differs from that obtained with a different sample. In such 4 
case, it is difficult to place much faith in the regression coefficients. 

Heteroscedasticity can often be detected by plotting the P-values against the error 
terms. If any pattern is displayed, heteroscedasticity is likely present. Figures 15—9(a) 
and 15-9(b) reveal possible patterns often encountered in the presence of hetero- 


a 
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scedasticity. Figure 15-9(c), however, does not suggest any detectable pattern, hetero- 
scedasticity appears to be absent. 


FIGURE 15-9 
(a) +e 


A Check for 
Heteroscedasticity 


If heteroscedasticity is suspected, the use of the generalized least-squares (GLS) 
method is recommended. A discussion of GLS can be found in more advanced texts. 

Although residual patterns are a good indication of heteroscedasticity, trying to 
read them is more of an art form than a scientific procedure. The patterns seldom 
cooperate by being as obvious as suggested above. We need more concrete methods of 
detecting heteroscedasticity. The remainder of this section presents common methods 
of identifying the presence of heteroscedasticity. 


Wuite’s Test ror Heteroscepasticiry In 1980, Halbert White offered such a 
method based on the x? distribution (see Chapter 12), His approach involves several 
well-defined steps: 


1. Run the original regression and obtain the error term for each observation. 

2. Square the error terms to get e? and regress them on all independent variables, the 
squares of all independent variables, and the cross-products of all independent 
variables. If there were three independent variables X,, X2, and X3, you must 
regress €? on Xy, Xa, Xa, X7, X3, X3, X,Xz XXa, and X,X;, This regression model 
is called the auxiliary model. 

3. Compute nR?, where n is the number of observations and R? is the unadjusted 
coefficient of determination from the auxiliary equation. 

4. If nR? > x2, reject the null that error variances are equal and assume hetero- 
scedasticity exists. 


Certain precautions must be observed in carrying out Step 2. Most notably for our 
purpose is the danger involved if dummy variables are used in the model. If X; isa 
dummy variable, then X? should not be included in the auxiliary equation because X; 
equals X? and perfect multicollinearity exists. In addition, the cross-product of two 
dummy variables is also excluded, since it equals zero. 
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We can use the data on consumer expenditures from Table 15-1 to illustrate. The 
auxiliary equation would regress e? on Xy X» X}, and X,Xz. Notice that X? is 
excluded since X; is a dummy variable. The results are 

e = —0.715 + 0.22X, — 5.814X, — 0.004XF + 0.188X,X3 
(0.867) (0.482) (0.808) (0.421) 
0.121. The p-values are reported in parentheses. Here, nR? = (15)(0.121) = 


= 9,488. Since nR? = 1.82 < 9.488, we do not reject the null that the 
and we conclude that heteroscedasticity does nop 


with R? = 
1.82, and x8.05.4 i 
error terms have equal variances, 


exist. 


E THe CURVILINEAR CASE 
Throughout our discussion, we have been assuming that the relationship between 
X and Y can be expressed by-a straight line. That is, the relationship is linear. How- 


ever, this is not always the case. Suppose we plotted our data in a scatter diagram and 
found results like those in Figure 15-10. A straight line Ê = a + bX would produce a 
poor fit. The data in Figure 15-10 suggest a nonlinear (or curvilinear) exponential 


relationship: 
Ê = hbf nsa] 


where bg and b, are constants. 


A Nonlinear 
Relationship 


Without the linearity, we must somehow transform the data for one or both 
variables in order to display it as a linear model. A common method of transformation 
uses logarithms. This logarithmic transformation makes the data linear in the log. In 
this event, or, if for any other reason, we suspect a nonlinear relationship, a log- 
arithmic transformation may be required, ; 


a 
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A linear relationship assumes that for every one-unit change in X, Y changes by a 
constant amount. In our Hop Scotch Airlines example, for every l-unit increase in 
advertising, passengers increased by 1.08 units. 

A curvilinear model assumes that Y changes by a different amount each time. 
Take, for example, a savings account earning a compound interest rate of 6 percent. 
The dollar value will increase by a larger amount each time period, since it earns 
6 percent of a larger base each time. This will generate an increasing function, as 
shown in Figure 15-10. It is called an increasing function because the additions to Y 
increase each time. In the event of a decreasing function, the additions to Y become 
smaller cach time. We will examine each case. 


la] An Increasing Function 
For an increasing function, the rules of logarithms allow ùs to express Formula (15.4) 


a ` 


a log Ŷ = log by + (log b,)X isa] 


Already it is looking like a linear expression, in which log Bois the intercept and log b, 
is the slope. The values for log bg and log b, are found from the equations 


ince: n&(X log Y) — (2X)Elog Y) 
s log 9) = ~ nix? — xy [156] 


log by = Hog y — log bee 


[15.7] 


After log Î has been calculated, we use antilogs to get the estimate for Y. We can best 
illustrate this with the use of time-series data. Time-series data are a collection of 
data over a series of time periods (days, months, years, etc.). Example 15-1 illustrates. 


EXAMPLE 


Case 


A common example of Ê = bgb¥ is found with many economic time series which tend 
to change by a certain percentage each time period. In'this event, we are estimating a 
trend line based on logarithms. d 


Monthly sales revenues in hundreds of dollars forthe Black Jack Coal Company 
were found to display the following values: 
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= den 

Revenues for Black — Month (X) Revenues (Y) log Y x Xlog Y (log Ñ 
Jack Coal Company 

1 31 1.4914 1 1.4914 2209 

2 43 1.6335 4 3.2669 2.869 

3 61 17653 9 5.3560 Tia 

4 85 19294 16 117 a72 

5 118 2.0719 25 10.3594 4.207 

6 164 2.2148 6 13,2891 4.5055 

7 228 2.3579 49 16.5055 5.5509 

8 316 2.4997 64 19.9975 6244 

9 444 2,6435 al 23,7911 aser 

10 6n 27660 10 27.8604 7760) 

55 2,101 21.4134 385 129.6349 47.5543 


The CEO for Black Jack noticed that sales tended to increase by about the same 
percentage each month. He calculated the percentage increase to be (43 — 31)/31 = 
39 percent. A'plot of the data revealed a pattem similar to that in Figure 15-10. He 
therefore felt a logarithmic trend projection was called for, and has asked the quan- 
titative analysis section to develop the regression model. 


SOLUTION: Beginning with 


P = bbž 
we have 
log Ê = log bo + (log b,)X 
` 10(129.6349) — 5521.4134) _ 
iii ree Sip. ee 
log bo = aLa = (0.1438) = 135 
Therefore, 


log Ê = 1.35 + 0.1438X 


Taking antilogs, we obtain 
Ê = 22,39(1.39)* 
INTERPRETATION: It would appear that the CEO was correct. For every-one-unit increas 


in X, Y goes up by 1.39 — 1.00 = 0.39 = percent. We find an estimate of sales for the 
sixth time period by setting X equal to 6 and solving the equation 


Ê = 22,39(1,39)6 = 22,39(7.213) = 161.5 


This projected value for sales closely approximates the actual sales level in the sixth 
time period of 164. 
An estimated value for sales in the twelfth period is 


Ê = 22.39(1.39)!2 = 22,39(52.02) = 1,164.7 


The 39 percent by which Y increases is the average amount by which Y increased 
each time period and is called the Instantaneous rate of growth. 


A 
Scanned with CamScanner 


* Chapter Fifteen Dummy Variables and Residual Analysis 759 


Notice that log Y was regressed on X. No values for log X were used. This is called 
a semilog (or log-linear) model, since logs were used for only one of the variables. 
In addition, we find the correlation coefficient from the equation 


ms n&(X log Y) — ZX(Zlog Y) 
V nx? — (EX)*I[nE(og Y? — Glog Y] [15.8] 


A 10(129.63) — 55(21.41) 
V[10(385) — (55)*][10(47.56) — (21.41)?] 


Thus, our model is explaining 98 percent of the change in Black Jack’s sales by using 
time as the explanatory variable. 


e| A Decreasing Function 


Suppose a plot for the data revealed a pattern as in Figure 15-11. Notice that Y tends 
to increase by a smaller amount each time. It flattens out rather than becoming steeper, 
as for the Black Jack Coal Company. In this instance, we require a semilog model 
which regresses Y on log X. 


| Ê = bo + b, log X [15.9] | 


Values for bg and b, are 


_ nog XY — Slog HEY 


1“ n&(log X? — Glog X}? [15.10] 
[rcune 15-1 I r 
A Decreasing 
Function aaa Eaa 
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Ears X Payments (Y) Log X Log X)?__(Log X) (Y) Y 
Pamane y 10 0.000 0.000 0.000 i 

Noah Fence 1989 1 2 T co P 

1990 2 235 0301 0.091 170; 38 

1991 3 271 0477 0.228 1.293 74 

1992 4 280 0802 0.362 1.686 ie 

8 200 ogg 048 27a 

15 1286 2079 1.170 573 5 


Be ZY — b,Zlog X 
z n [15.11] 


Consider the payments, in thousands of dollars, by the Noah Fence company over 
a five-year period, as in Table 15-3. Unlike the case for Black Jack Coal, the time 
periods are given in years. We must therefore recode them in single units to yield the 
X-values shown in Column 2. The first year is assigned a value of 1, with each 
succeeding year given successive values. This recoding process is necessary since we 
cannot use the year 1989, for example, in our calculations. The year 1989 is not a 
number. Envision the need for recoding if the time periods had been given as May 
through September instead of 1989 through 1993. We certainly could not use the tem 
May in our calculations! 


Then 


p, = @X6:713) — 2.079)(12.86) 
15 (LID — (2.079)? 


= 1.197 
by = 12.86 — 0.197/0) 

= 2.07 
Then 

f = 207 + 1.197 log X 
A forecast for 1997, which would be time period X = 9, is 

Ê = 2.07 + 1.197(log 9) 
Since log 9 = 0.954, we have 


Ê = 2.07 + 1.197(0.954) 
=3.21 


The estimated level of payments in 1997 is 3.21 thousand dollars. 
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Furthermore, 


p= boXY + b,X(log XY — nY? 


SY — ne [15.12] 


_ (2.07)(12.86) + 1.197(5.713) — 5(2.572) 
a 33.53 — 5(2.572) 


= 0.843 


E Other Possibilities 


If a plot of the data produces a pattern similar to Figure 15-12(a), thé model we used 
for the Black Jack Coal Company, in which we regressed log Y on X, may be 
appropriate. If the data produce the pattern in Figure 15-12(b), then the model in 
which we regress Y on log X, as with the Noah Fence Company, may yield the best 
results. In both cases, b, < 0. 


Fisure 15-12 


“Other Nonlinear 
Possibilities 


| >| Using Polynomial Models 


In many cases, the functional relationship between X and Y can be expressed by a 
polynomial model of some order higher than 1, For the examples cited, the appropriate 
model is 


P = bo + bX + bX? 


It uses as explanatory variables the values for X and X?, It is called a second-order 
polynomial because it contains more than one variable on the right-hand side, the 
highest power of which is 2. The general expression for such a polynomial function is 


Y = Bo + BIX + BoX? + BX? + -te + BAM +E [15.13] 
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Since this involves multiple regression, computations should be done 
computer. A computer run using SPSS-PC+ on the data for Black Jack Coal Produc 
the results 


P = 82.7 — 39.995X + 9.0227X? 
R2=099 Se = 20.0 


This compares with the results shown above, using the logarithmic model, of 


¥ = 22,39(1.39)* 
P=0.98 Se = 0.004 


It may be necessary to experiment with different functional forms in order t 
determine which provides the best fit. In search of the optimal model, the results from 
different logarithmic models may be compared with those obtained using polynomia 
functions, The use of computers makes this comparison practical. 

The results of such comparisons may, however, prove inconsistent. One mode] 
may report a higher coefficient of determination than another (that’s good) while 
carrying a higher standard error of the estimate (that’s bad). The question then 
becomes, which model do you use? F 

The answer depends, at least in part, on the purpose for which the model is 
intended. If you wish to use the model to explain present values of Y and to understand 
why it behaves as it does, use the model with the higher coefficient of determination, 
That is, if the intent is to explain, then the model with the higher explanatory value 
should be used. 

If, on the other hand, the purpose of the model is to predict future values of Y, use 
the model with the lower standard error of the estimate. If you want to predict, you 
would enjoy greater success with the model that generates the lower prediction error. 

However, such experimentation should be kept to a minimum. It is considered 
questionable, even unethical, to wildly experiment with one model and then another. 
You should know from the outset, given the nature of your research study, what 
procedure should be followed. The analogy is often made that to search blindly for the 
best model is similar to shooting the arrow and then drawing the target with the bull’s- 
eye at the spot where the arrow landed. 


Ee COMPUTER APPLICATIONS 


Much of the computer material relevant to this chapter was presented in the two previous 
chapters dealing with regression and correlation, However, a few additional.commands will 
prove useful. = 


Durbin-Watson Statistic 


Minitab requires the command for regression and a subcommand for the“Durbin-Watson. If 
ge want to regress data in Column | on two independent variables entered in Columns 2 and 
~ 3, type 


MTB> REGRESS C1 on 2 C2 C3; 
SUBC> DW, 
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SAS provides the Durbin-Watson if you request the option in the MODEL statement as 
PROC REG; 
MODEL Y = X1. . . Xg / DW; 
Finally, the Durbin-Watson may be obtained using SPSS-PC by the commands 
REGRESSION VARIABLES = Y X, Xz X3 
/DEPENDENT Y/METHOD ENTER/RESID DURBIN. 


CBS automatically provides the Durbin-Watson when you select Number 10, Multiple 
Regression. You can obtain this even for a simple regression model. ` 


E Plots of Residuals 


Minitab allows you to place the residuals in a column using the SUBC to the REGRESS 
command. If the data are entered in Columns 1 through 3, type 


MTB> REGRESS C1 ON 2 C2 C3; 


SUBC> RESIDUALS C4. 
MTB> DOTPLOT C4. 


SPSS-PC provides a residual plot with 


REGRESSION Y X1 X2/DEPENDENT Y/METHOD ENTER/CASEWISE ALL. 


This SPSS-PC command may be used with the one above for the Durbin-Watson. 
SAS-PC generates residuals with 


PROC GLM; 

MODEL Y = X1 X2;_ 

OUTPUT OUT = NEW P = PREDICT R = RESIDUAL; 
PROC PLOT; 

PLOT RESIDUAL*Y/VREF = 0; 


The OUTPUT statement creates a new data set called NEW (any valid SAS filename can be 
used). The P = PREDICT and R = RESIDUALS create two variables: the predicted value for 
Y and the residuals. The PLOT statement plots the variable RESIDUAL against the Y variable. 
The VREF = 0 places a horizontal line at zero on the vertical axis. 

CBS provides a plot of the residuals as an option you can select in the simple and multiple 
regression modules. 


6: SoLvED PROBLEMS 


1. Predicting Expenditures The Human Resources Cabinet for the state of Virginia wishes 
to devise a model for consumer expenditures that might aid in establishing a welfare 
system for the long-term unemployed. Consumption (Y) was regressed on number of 
family members: (X,), whether the head of houschold was employed (X2), and whether 
children were present in the home (X3). What would the model look like? 
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Here, X, and X, were dummy variables in which X, = 1 if unemployed and 0 iy 
employed, and X, = 1 if children were present and 0 if no children were present, The 
results might appear as 


Ý = 10 + 9X, — 8X, + 7X; 


Although the sizes of the coefficients are fictitious, note the negative sign of X4. Since 
X, was coded as | if unemployed, we might expect b, < 0. Why? Because unemploymen, 
should tend to decrease consumption expenditures. 
2. Testing for Autocorrelation If 50 observations were included in the data set for Prob. 
lem 1, does autocorrelation exist at the 5 percent level if d = 2.97? 
With k = 3 independent variables and n = 50, Table K yields critical values of d= 
1.42 and dy = 1.67. Then 


Test is 


+AC inconclusive inconclusive -AC 


No | Test is 


a, dy 4—dy 4-d, i 
1.42 1.67 2.33 2.58 29 


Yes, the results suggest negative AC. 

3. - A Cheeseburger in Paradise Jose has a thriving business selling authentic plastic Aztec 
relics to gullible American tourists in Mexico, His acounting records show that, over the 
past few months, profits have been increasing and the number of hours he works each 
week has been going down: 


Profits ($100’s) Hours 
122 87 
17.9 85 
258 82 
37.0 78 
53.3 69 
788 56 
112.9 39 


Jose wants to develop a regression model to predict profits, and another model to predict 

hours, A 

a. A graph of the data for profits reveals an increasing function, thereby suggesting + 
curvilincar model in which the log of profits is regressed on X. The computation 


appear as 

Profits (Y) x log Y Xiog¥ (gÝ 
122 c4 1 118 
179 2 ; 4 . iP 

Í 3 i 9 ’ : 
4 16 ? 26 
. 5 ; 25 . 2% 
i 6 f 36 ; 38 
f a ] 49 ; 4n 
z TA 179 


A 
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_ (7)(48.48) — (28)(10.99) 
log bi =" 7140) — (28) 


= 0.16076 
10. 
log by = 1938 — Jog.) $ 
= 0.9276 
log ¥ = 0.9276 + 0.16076X 
Ê = 8.46(1.45)* 


Furthermore, 


nies (748.48) — (28)(10.99) 
V((7)(140) — (28)7]((7)(17.994) — (10.99)"] 


= 31.64 
31.86 


0.9932 


The standard error is 0.0037, 
The polynomial function reports as 


Ê = 18.1 — 6.75x + 2.859x? 
R2=0997  Se=2.77 


b. The graph for hours produces results like those in Figure 15—12(b), requiring that we 
regress Y on log X. 


Hours (Y) log X log X)Y y2 (log XP 
87 1 0.00 0.00 7,569 0.00 
65 2 0.30 25.59 7,225 0.09 
82 3 0.48 39.12 6,724 023 
78 4 0.60 46.96 - 6,084 0.36 
e9 5 0.70 48.23 © 4,761 0.49 
56 6 0.78 43.58 3,196 061 

8 af 0.85 32.96 1,521 on 
496 28 3.71 236.44 37,020 2.49 
- 3 
a 7236.44) 71)(496) 
1 712.49) — B71) 
= —50.49 $ 
496 + (50.49)(3.71) 
bo SS ae 


= 96.62 


Then 


Ê = 96.62 — 50.49(log X) 
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Also, 
= (50.49)(236.44) — 7(70.86)? 


_ (96.62)(496) 

ps 37,020 — 110.86) 
= 0.675 

Se = 11.05 


The polymonial function is 
P = 82.29 + 5.179X — 1.607X? 
R?=0996 Se= 1.32 


6 CHAPTER CHECKLIST 
After studying this chapter, as a test of your knowledge of the essential points, can you 


— Explain when dummy variables should be used? 

— Define and give examples of dummy variables? 

— Incorporate dummy variables into a model? 

Define, graphically illustrate, and give examples of autocorrelation? 
— Interpret the Durbin-Watson statistic in terms of autocorrelation? 
— Define, graphically illustrate, and give examples of heteroscedasticity? 
Explain when a curvilinear model is required? 

Compute and interpret a curvilinear model? 


6 OF SYMBOLS AND TERMS ` 
Dummy Variable A qualitative variable that is assigned a coded value of 0 or 1. 


Autocorrelation Correlation between error terms in violation of the principles of 


OLS. 
Heteroscedasticity A violation of OLS principles in which error terms do not have the 
same variance. 


G@ OF FORMULAS ? 
This model for successive 


(15.1) E = Pep + by error terms is useful in 
detecting autocorrelation. 
The Durbin-Watson statistic 
measures the presence of 

m autocorrelation. 


115.2) dE 


A 
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[15.3] d=% -r) Quick method to estimate the 
Durbin-Watson. 
[15.4] Î=h ob% Functional form used to esti- 
r mate an increasing function. 
15.5 log ¥ = Logarithmic transformation 
Cal SES Sr pees Ree used to estimate (15.4). 
EX I = IXA Used to estimate the coeffi- 
[15.6] log b, = EG log 1) ee a = we ben cient in a nonlinear 
ý relationship. 
= Slog ¥ _ =X Used to estimate the intercept 
usa 108 bo n log br in a nonlinear relationship. 
n&(X log Y) — 2X(Zlog Y) The correlation coefficient 


038 e ag Measures the strength in a 
I — EXAM YF — Gog Y? nonlinear relationship. 


Functional form for a 


usa È = bo + b log X decreasing function. 
115.10) b= Jo = oe aye for a decreasing func- 
[15.12] P= BBY + biBiog X0) — n7? a aee e S 


EY? - nf 


R EXERCISES 


E You Make the Decision 


1. A coal firm wants to set up a regression model to predict output (Y) that encompasses as 
explanatory variables hours of labor input (X,) and whether a labor strike occurred during 
the time period under study (X3). Devise the model and explain. 

2. Given the model in the previous problem, should b be positive or negative? Explain. 

3. Wild Willie's Wennie World is testing a model to measure profits. It contains 5 
explanatory variables and 25 observations. At the 1 percent level, does autocorrelation 
exist if the Durbin-Watson is 2.37? 

4. Above what value must the Durbin-Watson be in the previous problem before negative 
autocorrelation is suspected? 

5. Below what value must the Durbin-Watson be before positive autocorrelation is present 
in Problem 3? 

6.. What is meant by positive autocorrelation? By negative autocorrelation? Explain with the 
aid of graphs, 

| 7. State what values you would assign to dummy variables to measure a person’s race ifthe 
categories included (1) white, (2) black, (3) Oriental, and (4) other. 

8. Students at the Cosmopolitan School of Cosmetics are taught to encode data on hair color 

„as | if blond, 2 if redhead, and 3 if other, Comment. What would you advise? 
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@ Problems 
9. The manager of a local accounting firm created a regression model for the length of time 


it takes to complete an audit. The model was 
Ê = 17 - 141X, + 1.73X, 


where Î is time in hours 
X is years of experience of auditor 
Xa whether auditor is a CPA: 0 if yes, 1 if no 


Interpret the coefficient for X2. 
Would you expect b to be positive? Explain. 
c. If the auditor has seven years of experience and is a CPA, how long would it take to 


complete the audit according to the model? K 
d. If another auditor also has seven years experience but is not a CPA, how long would 


it take to complete the audit according to the model? 
10. If the dummy variable in the previous problem was 1 if CPA, 0 if not CPA, what would 
you expect the sign of b, to be? Explain. 
11. A marketing representative establishes a regressi 
population in the sales district and whether the di: 
personnel report. The model proves to be 


Ê = 78.12 + 1.OIX, — 17.2X3 


A 


jon equation for units sold based on the 
istrict has a home office to which sales 


where Ŷ is units sold 
XxX, is population in thousands 
X is 0 if district contains an office, 1 if it does not 


a, Interpret b = —17.2. 
b. How would you compare the slopes and the coefficient of the two regression lines 


provided by this model? Compute and compare the two regression formulas. 
c. Draw a graph to illustrate. 

12. Considering the previous problem, if population is 17,000 in a district containing an 
office and 17,000 in a district without an office, what would the number of units sold in 
each one be? Draw a graph to illustrate. 

13, Studies have shown that in states with more liberal regulations concerning the receipt of 
unemployment compensation, unemployment rates are higher. If a regression model for 
unemployment rates incorporates a dummy variable, coded 1 if regulations are liberal and 
0 if otherwise, would its coefficient be greater than or less than zero according to these 
studies? Explain. 

14, A model to predict income is to contain a dummy variable for a person's highest edu- 
cation level consisting of (1) high school degree, (2) undergraduate degree, (3) master’s 
degree, (4) degree above master's. Another dummy variable for race that provides for (1) 
white, (2) black, and (3) other is to be included. 

a. How many dummy variables are necessary? 
b. Devise one possible coding system, Are there others? Explain. 

15. An employment agency is considering a mode! that uses dummy variables to measure an 
applicant's ability to hold a job. The current proposal is to create a dummy variable to 
denote that the applicant has work experience, and another dummy variable to indicate 
that the applicant has no work experience. Comment on this approach. 

16. A model to predict receipts on insurance claims by policyholders suffering a loss contains 
only one variable—a dummy variable coded 0 if male and 1 if female. For a male 
policyholder, what would ¥ be? What would the regression line look like? 
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17. The data shown were collected by Waldo's Wacky World of Wall Coverings on profit per 
sales (Y), square footage of area to be covered (X,), and whether the customer is a 
commercial business (C), private homeowner (H), or governmental unit (G). 


18. 


a 


b. 
c 
d. 


Observation Profit (Y) Area (X) Type 
1 408 1,400 H 
2 817 2,200 G 
3 502 1,500 C 
4 315 800 c 
5 782 1,000 H 
6 789 1,200 G 
7 604 1,100 c 
8 592 900 H 
9 801 1,600 G 

10 732 1,400 G 
LAI m 1,300 H 
12 612 1,100 c 
13 831 1,600 G 


Construct the data set with the proper dummy variables. Indicate the coding system 
used. 

Solve for the regression model and interpret the results. 

Draw a line for each type of customer. 

Is autocorrelation present at the 1 percent level? 


According to Business Week, there appears to be a significant difference in the nature of 
hostile takeovers of firms, depending on the region of the country in which the firm has its 
home office. Use the data shown for size of firm in annual sales (1,000's) (Y), price/ 
eamings ratio of the firm taken over (X,), and whether the firm’s office is located in the 
North (N), South (S), Midwest (M), or West (W). 


Observation Size (Y) PIE (X,) Location 
1 120 2.02 N 
2 320 312 s 
3 502 z AN w 
4 478 0.89 N 
5 389 2.89 sS 
6 565 242 w 
7 317 1.29 N 
8 488 1.89 S 
9 532 1.98 M 

10 619 201 M 


CNE the data set with the proper dummy variables. Indicate the coding system 
used. A 

Solve for the regression model and interpret your results. 

Draw a line for each regional location. 

Is autocorrelation present at the 5 percent level? 
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19. 


20. 
21. 


22. 


Given the data in the previous problem, test the coefficient for cach dummy variable to 
determine if it is significantly different from zero at the 5 percent level. 


Use White's method to test the data in Problem 18 for heteroscedasticity at æ = 0.05, 
‘The data shown here were collected to explain salary levels for workers at a local plant. 


go 

Salary ($1,000's) Years of Education Sex 
422 8 M 
58.9 12 M 
98.8 16 M 
235 6 F 
125 5 M 
67.8 12 M 
51.9 10 F 
81.6 14 F 
61.0 12 F 


Wt. eer 2 ES 


a. Compute the regression model, using a computer. 
b. Is there evidence of sex discrimination in salary levels? 

c. Is education useful in explaining salary? 

d. Are autocorrelation and heteroscedasticity problems? 

‘Twenty stores are examined to determine the effect floor space (FS) and location (L) have 
on profits. Location is a dummy variable, encoded with a 1 if urban and 0 if suburban. 
‘The 1-values are given in parentheses. The results are 


Ê = 343 + 23.2F5 ~ 12L 
09 (-1.6) 


Is location important at the 5 percent level? Explain. 
Twenty-five other stores are surveyed to examine the effect on revenue of local popula- 


tion (POP) and type of management (MAN). This last variable is a dummy variable 
encoded with a 1 if the manager owns the store and a 0 if he or she leases it. The results 


are shown here. The standard errors of the regression coefficients are shown in brackets. 


R= 34 + 344POP + 71.1MAN 
(27.2) 163.9] 


Is either POP or MAN significant at the 1 percent level? Explain. 
You have just run a model regressing employee retention (in years).on age at hiring and 
gender, encoding the dummy variable for gender as 1 if male and 0 if female. The results 
were 


f = 3.2 + 0.65AGE — 1,3GENDER 


a, What is the formula for male? For female? 
b. You then realize that you meant to encode 1 if female and 0 if male, What will the 
equation become? 
Noy ie the formula for male? For female? 

Jsing the formula scen above, what is the estimate of years of retention if a male is 
hired at age 23? What is it using your revised formals? 


an 


Scanned with CamScanner 


Chapter Fifteen Dummy Variables and Residual Analysis m 


Empirical Exercises 


25. Are the number of class hours in which a student is enrolled this term and his or her class 
(freshman, sophomore, junior, or senior) good predictors of the number of hours a student 
spends studying each week? Obtain a sample of your fellow students to determine the 
usefulness of such a model. 

26. Obtain a sample of at least 30 stocks traded on the New York Stock Exchange. Does it 
appear that the trading volume for a stock can be predicted by (1) the difference between 
the high and low for the day and (2) whether the stock closed up, closed down, or 
remained unchanged. 


Computer Exercises 


27. Access the file FIRMS. It contains data on 50 firms with observations on profit levels 
measured in dollars, whether the firm is private, public, or quasi-public, and whether it is 
profit oriented or nonprofit oriented. Does it appear that these data can be used to explain 
the profitability of business concerns? Interpret your results. 

28. From the results of your computer run on FIRMS in the exercise above, determine how 
the model might be improved. Run the program with these changes and compare the 
results to your first run. 

29. Given the data in FIRMS, does there appear to be a problem with multicollinearity or 
autocorrelation? 


1. Holly Wood is a district manager for a chain of video stores that rents movies to the 
gencral public. The past year has seen a considerable fluctuation in revenues. Ms. Wood 
would like to identify those forces that explain rentals. She has collected data for 15 stores 
for the number of movies rented during 4 given week, population in hundreds within a 
2-mile radius of the store, whether a movie theater is within a 2-mile radius of the store, 
and whether the store is located in a city (C), suburban (S), or rural (R) setting. T indicates 
a movie theater is within the 2-mile radius; T' indicates that one is not. 


Observation Movies Population Theater Setting 
1 407 110 qT c 
2 809 420 T c 
3 511 150 ne! R 
4 308 101 T s 
5 417 112 T R 
6 540 132 qT s 
7 718 380 T s 
8 580 155 T H 
9 798 410 A c 

10 884 450 T i 
" 693 312 k R 
12 782 389 r R 
13 619 301 1 . 
14 592 289 a A 
15 667 232 
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In addition to wondering how well her model fits the data, Ms. Wood is alo 
concemed about the presence of autocorrelation and heteroscedasticity. How can you help? 
In a dispute among secondary teachers in Shelby County, Tennessee, charges of sex and 
age discrimination have been leveled at the county school board. These charges have been 
brought by teachers who feel their salaries do not reflect their level of experience or who 
believe women are paid less than men. Robert Gay, a history teacher with 17 years 
experience, is particularly distressed by his salary of $25,500. i 

The county school superintendent, alarmed by these allegations, prepares to investi- 
gate. He randomly selects 25 teachers, 10 males and 15 females, noting their years of 
experience, salary, and gender. The results are 


2 


Teacher Gender Experience (yr) Salary (1,000's) 
1 M 10 248 
2 F 8 Me 
Bi F 7 14 
4 M 12 174 
5 F 15 189 
6 M 21 21 
7 M 2 19.1 
8 F 10 165 
g F 8 152 
10 F 19 19.3 
11 F 7 10.5 
12 F 12 15.5 
13 M 15 157 
14 F 12 15 
15 M 17 25.5 
16 F 9 16 
17 F 7 17.4 
18 M 12 174 
19 F 8 1 
20 M 10 115 

21 F 14 122 
2 F 3 10 
23 F 12 7 
24 M 10 7 
25 M 9 7 


The superintendent must now respond to the teachers’ charges, including that by Mr- 
Gay, How might he proceed? Would the standard error of the estimatè`be useful? How 
about 7? or the slope coefficient? 

3. Owners of Mighty Muscles, Inc., a health club in Hoboken, New Jersey, are trying to 
increase membership at the club. After reviewing operations at other health clubs in the 
area, Harry Hunk, the manager at Mighty Muscles, feels that there might be a relationship 
between the number of members in a health club and several explanatory variables, in- 
cluding (1) monthly dues, (2) local population, and (3) type of weight-training equipment 
in the club, Three kinds of weight machines arc commonly found in health clubs: 
MuscleUp (MU), Iron Man (IM), and Studly Dude (SD). Mr. Hunk collects data for these 
variables for 20 health clubs similar to Mighty Muscles, Analyze the data‘and determine 
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how well the model fits the data, Interpret your findings and recommend to Harry the 
policy decisions he should make for the club. 


Population 
Membership Dues (in 1,000's) Equipment Type 
45 $22.50 397 MU 
387 21.00 43.3 MU 
412 20.00 477 MU 
434 20.00 489 sD 
467 18.50 5.5 sD 
490 18,00 55.7 sD 
612 15.00 62.1 SD 
523 17.50 56.7 IM 
545 17.00 58.9 IM 
567 16.50 60.0 IM 
578 16.00 61.2 sD 
599 15.00 63.6 sD 
634 14.50 60.0 IM 
656 13.00 61.0 sD 
669 13.10 614 MU 
134 30.00 235 MU 
243 27.00 278 IM 
259 25.50 32.1 M 
278 23.00 373 SD 
321 24.00 35.8 SD 
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